Utilizing Metadata Search
Introduction
Version 1.13 of Layar introduced the ability to attach metadata to documents and tables. This allows for more accurate RAG which improves the ability for the model to respond accurately. This guide details how to embed documents with metadata tags, then utilize the generate and search endpoints to get chunks related to the metadata filters.
Embedding Metadata Tags
Lets go over how to embed documents and tables with metadata tags.
Considerations Before EmbeddingOnce embedded metadata tags become immutable. For example, if you embed a document with the tag
Section_Numberand give the string value'1.2.3.4'that tags data type becomes permanent. In the future if you wanted to changeSection_Numberto integer value1.2.3.4you will experience errors when re-embedding. This is becauseSection_Numbermust have a string value. The workaround to this is to name the tagSection_Number_1. This allows you to change the tag name slightly toSection_Number_2in the future, which then allows you to change the data type.We highly suggest confirming your metadata tag names and values before embedding.
Preparing Table Tags
Lets look at a simple table.
| Takeaways | Disease |
|---|---|
| Causes inflammation of the lower intestines. | Chrons Disease |
Before uploading the document to Layar a new column must be added that contains a JSON dictionary with the tags and values. Lets look at a simple JSON structure.
{'Pubmed_Author' : 'D Schindler',
'Section' : 'abstract',
'Methodology' : 'double blind'}
Tag NamesYou can't use spaces in the JSON key name, any space must be an underscore, IE
Pubmed_Author
Now we add the JSON to a new column, this column can be named anything but it's best to specify what the column contains.
Takeaways | Disease | metadata |
|---|---|---|
Causes inflammation of the lower intestines. | Chrons Disease | {'Pubmed_Author' : 'D Schindler', |
This document can now be uploaded to Layar via the API or the front end. Upload Documents
Embed Table With Tags
Once the document is uploaded you need to embed the document using the metadataColumn parameter in the JSON payload of the request. The value needs to be the column name containing the metadata. The request would look like this.
response = requests.post(f'{envUrl}/layar/sourceDocument/{documentId}/createEmbeddings',
headers = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Data-Fabric': dataFabric},
json = {
"metadataColumn": "metadata" #The column we added to the table
})You have now embedded the table with the metadata filters attached to the text chunks.
Embed Document With Tags
As with tables, documents can have a JSON dictionary of metadata tags attached to text chunks. The JSON dictionary is supplied in the body of the request with the tags parameter.
response = requests.post(f'{envUrl}/layar/sourceDocument/{documentId}/createEmbeddings',
headers = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Data-Fabric': dataFabric},
json = {
'tags': {
'Pubmed_Author': 'D Schindler',
'Section': 'abstract',
'Methodology': 'double blind'
}})Searching For Tagged Chunks
Both /layar/gpt/generateand /layar/gpt/retrieval/search allow you to utilize metadata searches in order to further influence the chunks returned from the vector database. Both endpoints take the searchFilters, searchOperator and query parameter, which allows you to dictate how metadata tags are retrieved. Lets look at an example and breakdown each
{
"query": "What conclusion was reached in this research paper?",
"searchOperator": "AND"
"searchFilters" : [{
"documentId" : "AZWvMlBe-OBvyKYpiNE-",
"provider": "dldev02.vyasa.com",
"documentType": "TABLE",
"searchConditions" : [{"field" : "Pubmed_Auhor",
"operator": '==',
"value": 'D Schindler'},
{"field": "Section",
"operator": "==",
"value": 'abstract'}
]
}]
}query (required)
A string value of the RAG query that will be used along with the metadata filters. You must provide this in order to get results.
rag_query VS queryThe
generateandsearchendpoint have different parameters for specifying query. For the generate endpoint, it will berag_query. Thesearchendpoint usesquery.
searchOperators (required)
A string value that can take two values, AND or OR. This will dictate which results to show. For the example above, AND is given. This means that if both Pubmed_Author AND Section are not supplied in the metadata for a chunk, the chunk will not be returned. Currently, this search operator applies to ALL filters.
searchFilters (required)
A list of nested JSON containing the document or set you wish to search along with the metadata search conditions.
documentId (required)
A string containing the documentId you want to search on. Required if savedListIdis not provided.
provider (required)
A string containing the provider the documents are contained in. For the most part this is going to be the URL of the environment.
documentType (required)
A string containing the document type the search will be targeting. This value can be two things either TABLE or DOCUMENT.
documentType With Mixed SetsSets you create may have a mix of
TABLEandDOCUMENTin them. In this case you must provide a search filter for bothTABLEandDOCUMENTin order to ensure you get chunks back from both types.
searchConditions (required)
A list of JSON dictionaries that contain the metadata filters.
field (required)
A string containing the field you want to filter on.
Nested JSONMetadata can be nested JSON. Lets look at an example metadata JSON
{'Pubmed_Author' : 'D Schindler', 'Section' : 'abstract', 'Methodology' : { 'approach': 'double blind', 'centers': 5, 'locations' ['North America', 'Europe']}General structure for searching nested JSON is
parentkey_childkey.
If you wanted to filter forlocationsthefieldwould need to beMethodology_locations.
operator (required)
A string value detailing what operator should be used for the specific field. The valid operators are < , !=, >=, not in, in, like, >, ==, and <=. It's important to keep in mind what sort of data type is associated with the field.
For example, if the data type for the field is a list you can use not inand in but you can't use ==or the greater than / less than operators. Vice versa if the data type is an integer, you can't use in or not in.
value (required)
The specific value you are looking for that is in the fieldspecified.
Tips and Tricks
searchConditionsis a list of dictionaries. ThesearchOperatorapplies to every condition supplied. It's very important you decide on whichsearchOperatoryou want to use, since that will determine how you setup yoursearchConditionsJSON.- The
querybeing used is important because it trumps the metadata filtering. For example, If the query given has no relation to the chunk the metadata tags are attached to, it's possible the vector database won't return the chunk. This is because the search is still reliant on semantic and keyword similarity. If the vector database doesn't find it relevant, it will not return the chunk. Retriever settings also play a part in this, if the similarity threshold is too high or number of chunks is to low, you may still miss out on chunks even though the metadata tags are present.
Searching For All Metadata
You can utilize the searchFilters payload to look up metadata schema for a specific documents by doing a POST to /layar/gpt/retrieval/getDocumentProperties.
response = requests.post(f'{envUrl}/layar/gpt/retrieval/getDocumentProperties',
headers = header,
json = {
"searchFilters": [{
"documentId": "<YOUR_DOCUMENT_ID>",
"provider": "<YOUR_DATA_PROVIDER>",
"documentType": "TABLE",
"searchConditions" : [{"field" : "player",
"operator": '==',
"value": 'Babe Ruth'}
]
}]
}).json()documentType can be changed to DOCUMENTif you are looking for metadata attached to a document.
Supplemental Recipes
Here are some recipes that go over how to use metadata search as a standalone retrieval or via the generate endpoint.
Updated 3 days ago
