Utilizing Metadata Search

Introduction

Version 1.13 of Layar introduced the ability to attach metadata to documents and tables. This allows for more accurate RAG which improves the ability for the model to respond accurately. This guide details how to embed documents with metadata tags, then utilize the generate and search endpoints to get chunks related to the metadata filters.

Embedding Metadata Tags

Lets go over how to embed documents and tables with metadata tags.

❗️
Considerations Before Embedding
Once embedded metadata tags become immutable. For example, if you embed a document with the tag Section_Number and give the string value '1.2.3.4' that tags data type becomes permanent. In the future if you wanted to change Section_Numberto integer value 1.2.3.4you will experience errors when re-embedding. This is because Section_Numbermust have a string value. The workaround to this is to name the tag Section_Number_1. This allows you to change the tag name slightly to Section_Number_2 in the future, which then allows you to change the data type.
We highly suggest confirming your metadata tag names and values before embedding.

Preparing Table Tags

Lets look at a simple table.

Takeaways	Disease
Causes inflammation of the lower intestines.	Chrons Disease

Before uploading the document to Layar a new column must be added that contains a JSON dictionary with the tags and values. Lets look at a simple JSON structure.

{'Pubmed_Author' : 'D Schindler',
 'Section' : 'abstract',
 'Methodology' : 'double blind'}

📘
Tag Names
You can't use spaces in the JSON key name, any space must be an underscore, IE Pubmed_Author

Now we add the JSON to a new column, this column can be named anything but it's best to specify what the column contains.

Takeaways	Disease	metadata
Causes inflammation of the lower intestines.	Chrons Disease	{'Pubmed_Author' : 'D Schindler', 'Section' : 'abstract', 'Methodology' : 'double blind'}

This document can now be uploaded to Layar via the API or the front end. Upload Documents

Embed Table With Tags

Once the document is uploaded you need to embed the document using the metadataColumn parameter in the JSON payload of the request. The value needs to be the column name containing the metadata. The request would look like this.

response = requests.post(f'{envUrl}/layar/sourceDocument/{documentId}/createEmbeddings',
                              headers = {'Accept': 'application/json',
          															'Content-Type': 'application/json',
          															'Authorization': f"Bearer {token}",
          															'X-Vyasa-Data-Fabric': dataFabric},
                              json = {
                                 "metadataColumn": "metadata" #The column we added to the table
                              })

You have now embedded the table with the metadata filters attached to the text chunks.

Embed Document With Tags

As with tables, documents can have a JSON dictionary of metadata tags attached to text chunks. The JSON dictionary is supplied in the body of the request with the tags parameter.

response = requests.post(f'{envUrl}/layar/sourceDocument/{documentId}/createEmbeddings',
                              headers = {'Accept': 'application/json',
          															'Content-Type': 'application/json',
          															'Authorization': f"Bearer {token}",
          															'X-Vyasa-Data-Fabric': dataFabric},
                              json = {
																	'tags': {
                                    	'Pubmed_Author': 'D Schindler',
                                    	'Section': 'abstract',
                                    	'Methodology': 'double blind'
                                  }})

Searching For Tagged Chunks

Both /layar/gpt/generateand /layar/gpt/retrieval/search allow you to utilize metadata searches in order to further influence the chunks returned from the vector database. Both endpoints take the searchFilters, searchOperator and query parameter, which allows you to dictate how metadata tags are retrieved. Lets look at an example and breakdown each

{
  "query": "What conclusion was reached in this research paper?",
  "searchOperator": "AND"
  "searchFilters" : [{
    "documentId" : "AZWvMlBe-OBvyKYpiNE-",
    "provider": "dldev02.vyasa.com",
    "documentType": "TABLE",
    "searchConditions" : [{"field" : "Pubmed_Auhor",
                           "operator": '==',
                           "value": 'D Schindler'},
                          {"field": "Section",
                           "operator": "==",
                           "value": 'abstract'}
                         ]
  }]
}

query (required)

A string value of the RAG query that will be used along with the metadata filters. You must provide this in order to get results.

📘
rag_query VS query
The generate and search endpoint have different parameters for specifying query. For the generate endpoint, it will be rag_query. The search endpoint uses query.

searchOperators (required)

A string value that can take two values, AND or OR. This will dictate which results to show. For the example above, AND is given. This means that if both Pubmed_Author AND Section are not supplied in the metadata for a chunk, the chunk will not be returned. Currently, this search operator applies to ALL filters.

searchFilters (required)

A list of nested JSON containing the document or set you wish to search along with the metadata search conditions.

documentId (required)

A string containing the documentId you want to search on. Required if savedListIdis not provided.

provider (required)

A string containing the provider the documents are contained in. For the most part this is going to be the URL of the environment.

documentType (required)

A string containing the document type the search will be targeting. This value can be two things either TABLE or DOCUMENT.

📘
documentType With Mixed Sets
Sets you create may have a mix of TABLE and DOCUMENT in them. In this case you must provide a search filter for both TABLE and DOCUMENT in order to ensure you get chunks back from both types.

searchConditions (required)

A list of JSON dictionaries that contain the metadata filters.

field (required)

A string containing the field you want to filter on.

📘
Nested JSON
Metadata can be nested JSON. Lets look at an example metadata JSON
{'Pubmed_Author' : 'D Schindler',  
 'Section' : 'abstract',  
 'Methodology' : {
  'approach': 'double blind',
  'centers': 5,
  'locations' ['North America', 'Europe']}
General structure for searching nested JSON is parentkey_childkey.
If you wanted to filter for locations the field would need to be Methodology_locations.

operator (required)

A string value detailing what operator should be used for the specific field. The valid operators are < , !=, >=, not in, in, like, >, ==, and <=. It's important to keep in mind what sort of data type is associated with the field.

For example, if the data type for the field is a list you can use not inand in but you can't use ==or the greater than / less than operators. Vice versa if the data type is an integer, you can't use in or not in.

value (required)

The specific value you are looking for that is in the fieldspecified.

📘
Tips and Tricks

searchConditions is a list of dictionaries. The searchOperatorapplies to every condition supplied. It's very important you decide on which searchOperatoryou want to use, since that will determine how you setup your searchConditionsJSON.

Thequery being used is important because it trumps the metadata filtering. For example, If the query given has no relation to the chunk the metadata tags are attached to, it's possible the vector database won't return the chunk. This is because the search is still reliant on semantic and keyword similarity. If the vector database doesn't find it relevant, it will not return the chunk. Retriever settings also play a part in this, if the similarity threshold is too high or number of chunks is to low, you may still miss out on chunks even though the metadata tags are present.

Searching For All Metadata

You can utilize the searchFilters payload to look up metadata schema for a specific documents by doing a POST to /layar/gpt/retrieval/getDocumentProperties.

response = requests.post(f'{envUrl}/layar/gpt/retrieval/getDocumentProperties',
                         headers = header,
                         json = {
                            "searchFilters": [{
                                "documentId": "<YOUR_DOCUMENT_ID>",
                                "provider": "<YOUR_DATA_PROVIDER>",
                                "documentType": "TABLE",
                                "searchConditions" : [{"field" : "player",
                                                        "operator": '==',
                                                        "value": 'Babe Ruth'}
                                                       ]
                            }]
                         }).json()

documentType can be changed to DOCUMENTif you are looking for metadata attached to a document.

Supplemental Recipes

Here are some recipes that go over how to use metadata search as a standalone retrieval or via the generate endpoint.