Text Clustering with Document Search
Text clustering allows you to organize documents based on the sentences and phrases used in them. Layar allows you text cluster using document search in order to limit the documents used.
Pre-Reqs
Before a document search can be done the API requests must be authenticated. Make sure you have already followed the instructions for importing dependencies and authentication from the Getting Started Guide.
Check Your Imported Modules
Make sure you have imported the
requests
andjson
module before proceeding with this guide.
Create Header
The header for this request will use X-Vyasa-Advanced-Parameters
. This parameter takes a dictionary of values. The parameter can be used to determine the method of text clustering, either bertopic
or umaphdbscan
. Each type of task will need varying advanced parameters. For this example bertopic
is the task we will use.
Advanced Parameters Details
You can find details on the various parameters here: Clustering Parameters
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
'X-Vyasa-Advanced-Parameters' :
{'task':'bertopic',
'remove_stops': True,
'stopwords' : [],
'ngram_range : (1,1),
'min_topic_size' : 20,
'num_keywords' : 1,
'ngram_weight' : 4}
}
Create Request Body
The request body will mimic the body used when performing a document search. For more details on possible values you can use, review Search Documents. We will be using a simple query search to find documents for clustering.
body = {
'q': "JAK",
'rows' : 20
}
Clustering Request
We can use /sourceDocument/startClusterDocsJob
endpoint to create the project with a document search and get the ID of the project.
textClusterUrl = f'{envUrl}/sourceDocument/startClusterDocsJob'
projectId = requests.post(textClusterUrl,
headers = header,
json = body
).json().get('id')
Project Status Request
Now that we have the projectId created, we can use /projectComputation/{projectid}
to get the status of the project. X-Vyasa-Advanced-Parameters
and a request body will not be needed.
projectStatusUrl = f'{envUrl}//projectComputation/{projectId}'
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID'
}
status = requests.get(projectStatusUrl,
headers = header
).json().get('status')
print(status)
Updated 7 months ago