Utilizing Metadata Search
Introduction
Version 1.13 of Layar introduced the ability to attach metadata to documents and tables. This allows for more accurate RAG which improves the ability for the model to respond accurately. This guide details how to embed documents with metadata tags, then utilize the generate and search endpoints to get chunks related to the metadata filters.
Embedding Metadata Tags
Lets go over how to embed documents and tables with metadata tags.
Considerations Before Embedding
Once embedded metadata tags become immutable. For example, if you embed a document with the tag
Section_Number
and give the string value'1.2.3.4'
that tags data type becomes permanent. In the future if you wanted to changeSection_Number
to integer value1.2.3.4
you will experience errors when re-embedding. This is becauseSection_Number
must have a string value. The workaround to this is to name the tagSection_Number_1
. This allows you to change the tag name slightly toSection_Number_2
in the future, which then allows you to change the data type.We highly suggest confirming your metadata tag names and values before embedding.
Preparing Table Tags
Lets look at a simple table.
Takeaways | Disease |
---|---|
Causes inflammation of the lower intestines. | Chrons Disease |
Before uploading the document to Layar a new column must be added that contains a JSON dictionary with the tags and values. Lets look at a simple JSON structure.
{'Pubmed_Author' : 'D Schindler',
'Section' : 'abstract',
'Methodology' : 'double blind'}
Tag Names
You can't use spaces in the JSON key name, any space must be an underscore, IE
Pubmed_Author
Now we add the JSON to a new column, this column can be named anything but it's best to specify what the column contains.
Takeaways | Disease | metadata |
---|---|---|
Causes inflammation of the lower intestines. | Chrons Disease | {'Pubmed_Author' : 'D Schindler', 'Section' : 'abstract', 'Methodology' : 'double blind'} |
This document can now be uploaded to Layar via the API or the front end. Upload Documents
Embed Table With Tags
Once the document is uploaded you need to embed the document using the metadataColumn
parameter in the JSON payload of the request. The value needs to be the column name containing the metadata. The request would look like this.
response = requests.post(f'{envUrl}/layar/sourceDocument/{documentId}/createEmbeddings',
headers = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Data-Fabric': dataFabric},
json = {
"metadataColumn": "metadata" #The column we added to the table
})
You have now embedded the table with the metadata filters attached to the text chunks.
Embed Document With Tags
As with tables, documents can have a JSON dictionary of metadata tags attached to text chunks. The JSON dictionary is supplied in the body of the request with the tags
parameter.
response = requests.post(f'{envUrl}/layar/sourceDocument/{documentId}/createEmbeddings',
headers = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Data-Fabric': dataFabric},
json = {
'tags': {
'Pubmed_Author': 'D Schindler',
'Section': 'abstract',
'Methodology': 'double blind'
}})
Searching For Tagged Chunks
Both /layar/gpt/generate
and /layar/gpt/retrieval/search
allow you to utilize metadata searches in order to further influence the chunks returned from the vector database. Both endpoints take the searchFilters
, searchOperator
and query
parameter, which allows you to dictate how metadata tags are retrieved. Lets look at an example and breakdown each
{
"query": "What conclusion was reached in this research paper?",
"searchOperator": "AND"
"searchFilters" : [{
"documentId" : "AZWvMlBe-OBvyKYpiNE-",
"provider": "dldev02.vyasa.com",
"documentType": "TABLE",
"searchConditions" : [{"field" : "Pubmed_Auhor",
"operator": '==',
"value": 'D Schindler'},
{"field": "Section",
"operator": "==",
"value": 'abstract'}
]
}]
}
query (required)
A string value of the RAG query that will be used along with the metadata filters. You must provide this in order to get results.
rag_query VS query
The
generate
andsearch
endpoint have different parameters for specifying query. For the generate endpoint, it will berag_query
. Thesearch
endpoint usesquery
.
searchOperators (required)
A string value that can take two values, AND
or OR
. This will dictate which results to show. For the example above, AND
is given. This means that if both Pubmed_Author
AND Section
are not supplied in the metadata for a chunk, the chunk will not be returned. Currently, this search operator applies to ALL filters.
searchFilters (required)
A list of nested JSON containing the document or set you wish to search along with the metadata search conditions.
documentId (required)
A string containing the documentId you want to search on. Required if savedListId
is not provided.
provider (required)
A string containing the provider the documents are contained in. For the most part this is going to be the URL of the environment.
documentType (required)
A string containing the document type the search will be targeting. This value can be two things either TABLE
or DOCUMENT
.
documentType With Mixed Sets
Sets you create may have a mix of
TABLE
andDOCUMENT
in them. In this case you must provide a search filter for bothTABLE
andDOCUMENT
in order to ensure you get chunks back from both types.
searchConditions (required)
A list of JSON dictionaries that contain the metadata filters.
field (required)
A string containing the field you want to filter on.
Nested JSON
Metadata can be nested JSON. Lets look at an example metadata JSON
{'Pubmed_Author' : 'D Schindler', 'Section' : 'abstract', 'Methodology' : { 'approach': 'double blind', 'centers': 5, 'locations' ['North America', 'Europe']}
General structure for searching nested JSON is
parentkey_childkey
.
If you wanted to filter forlocations
thefield
would need to beMethodology_locations
.
operator (required)
A string value detailing what operator should be used for the specific field
. The valid operators are <
, !=
, >=
, not in
, in
, like
, >
, ==
, and <=
. It's important to keep in mind what sort of data type is associated with the field
.
For example, if the data type for the field is a list you can use not in
and in
but you can't use ==
or the greater than / less than operators. Vice versa if the data type is an integer, you can't use in
or not in
.
value (required)
The specific value you are looking for that is in the field
specified.
Tips and Tricks
searchConditions
is a list of dictionaries. ThesearchOperator
applies to every condition supplied. It's very important you decide on whichsearchOperator
you want to use, since that will determine how you setup yoursearchConditions
JSON.- The
query
being used is important because it trumps the metadata filtering. For example, If the query given has no relation to the chunk the metadata tags are attached to, it's possible the vector database won't return the chunk. This is because the search is still reliant on semantic and keyword similarity. If the vector database doesn't find it relevant, it will not return the chunk. Retriever settings also play a part in this, if the similarity threshold is too high or number of chunks is to low, you may still miss out on chunks even though the metadata tags are present.
Searching For All Metadata
You can utilize the searchFilters
payload to look up metadata schema for a specific documents by doing a POST to /layar/gpt/retrieval/getDocumentProperties
.
response = requests.post(f'{envUrl}/layar/gpt/retrieval/getDocumentProperties',
headers = header,
json = {
"searchFilters": [{
"documentId": "<YOUR_DOCUMENT_ID>",
"provider": "<YOUR_DATA_PROVIDER>",
"documentType": "TABLE",
"searchConditions" : [{"field" : "player",
"operator": '==',
"value": 'Babe Ruth'}
]
}]
}).json()
documentType
can be changed to DOCUMENT
if you are looking for metadata attached to a document.
Supplemental Recipes
Here are some recipes that go over how to use metadata search as a standalone retrieval or via the generate endpoint.
Updated 7 days ago