Text Clustering Parameters
There are a variety of parameters that are needed in order to use the clustering endpoint /sourceDocument/startClusterDocsJob. These values need to be supplied in the header. This guide will go over what those values are and what they do.
Create Header
The header for this request will use X-Vyasa-Advanced-Parameters. This parameter takes a dictionary of values. The parameter can be used to determine the method of text clustering, either bertopic or umaphdbscan. Each type of task will need varying advanced parameters.
bertopic
bertopicWhen using the bertopic task, there are variety of advance parameters that can be used.
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
'X-Vyasa-Advanced-Parameters' :
{'task':'bertopic',
'remove_stops': True,
'stopwords' : [],
'ngram_range : [1,1],
'min_topic_size' : 20,
'num_keywords' : 1,
'ngram_weight' : 4}
}
remove_stops
remove_stopsA boolean value that determines if stop words should be removed when clustering. This defaults to True if not provided.
What Is A Stop Word?
A stop word is an inconsequential/filler word. Examples would be conjunctions like "and" or "but".
stopwords
stopwordsA list of strings. This value will only be used if remove_stopsis set to False. You can set specific words to be treated as stop words.
Should You Use Custom Stopwords?
It's suggested to not initially use stop words. Setting
remove_stopsto True causes the API to use generic stop words commonly found in the English language.
ngram_range
ngram_rangeA range that determines how model associates topics with words. For example, "New York" is two words but represents a city. If an ngram_rangeof (1,1) is given, "New" and "York" would be looked at as separate words. Using a range of (1,2) would cause it to see "New York" as one word.
min_topic_size
min_topic_sizeAn integer that determines the minimum size of a topic. Giving a value that is too large can result in no clusters being generated while giving a value too small can result in small irrelevant clusters. This value defaults to 20 if the value is not supplied.
num_keywords
num_keywordsAn integer that determines the number of keywords used for clustering. Defaults to 1 if not provided and requires a value of 1 or higher. Raising the value causes more keywords to be used for clustering but can result in irrelevant clusters.
ngram_weight
ngram_weightAn integer that determines how the model chooses N-Grams that are relevant to one another. The value dictates how many adjacent N-Grams are needed in order determine if text should be in the same cluster. Defaults to 4 if not provided. Increasing the value will result in tighter clustering while a lower value results in looser clustering.
umaphdbscan
umaphdbscanAlong with the task, hyperparameters, max_iters and autotune needs to be provided in the header.
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
'X-Vyasa-Advanced-Parameters' :
{"task":"umaphdbscan",
"max_iters": None,
"autotune": True
"hyperparameters": {"n_neighbors" : 2, "n_components":1, "min_cluster_size": 1}}
}
max_iters
max_itersAn integer, defaults to None if not supplied. Determines how many evaluations are done before concluding the clustering project.
autotune
autotuneA boolean value that determines if the preset hyperparameters are used. Defaults to true if not provided.
hyperparameters
hyperparametersA dictionary that contains three values n_neighbors ,n_componenets and min_cluster_size .
n_neighbors
n_neighborsAn integer greater then one, increasing this will result in larger cluster sizes.
n_components
n_componentsAn integer value, defaults to 5 if no value is given. Reduces the dimensionality of the embedding. Giving too low a value will result in improper clustering.
min_cluster_size
min_cluster_sizeAn integer value, defaults to 10 if no value is given. A smaller value causes small clusters to be created resulting in more clusters. It's better to increase this value then decrease it.
Putting It All Together!
If you are interested to see how these parameters come together to create text clustering project using document search view the guide Text Clustering Project with Document Search
Updated over 1 year ago
