Text Clustering Parameters
There are a variety of parameters that are needed in order to use the clustering endpoint /sourceDocument/startClusterDocsJob
. These values need to be supplied in the header. This guide will go over what those values are and what they do.
Create Header
The header for this request will use X-Vyasa-Advanced-Parameters
. This parameter takes a dictionary of values. The parameter can be used to determine the method of text clustering, either bertopic
or umaphdbscan
. Each type of task will need varying advanced parameters.
bertopic
bertopic
When using the bertopic task, there are variety of advance parameters that can be used.
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
'X-Vyasa-Advanced-Parameters' :
{'task':'bertopic',
'remove_stops': True,
'stopwords' : [],
'ngram_range : [1,1],
'min_topic_size' : 20,
'num_keywords' : 1,
'ngram_weight' : 4}
}
remove_stops
remove_stops
A boolean value that determines if stop words should be removed when clustering. This defaults to True if not provided.
What Is A Stop Word?
A stop word is an inconsequential/filler word. Examples would be conjunctions like "and" or "but".
stopwords
stopwords
A list of strings. This value will only be used if remove_stops
is set to False. You can set specific words to be treated as stop words.
Should You Use Custom Stopwords?
It's suggested to not initially use stop words. Setting
remove_stops
to True causes the API to use generic stop words commonly found in the English language.
ngram_range
ngram_range
A range that determines how model associates topics with words. For example, "New York" is two words but represents a city. If an ngram_range
of (1,1) is given, "New" and "York" would be looked at as separate words. Using a range of (1,2) would cause it to see "New York" as one word.
min_topic_size
min_topic_size
An integer that determines the minimum size of a topic. Giving a value that is too large can result in no clusters being generated while giving a value too small can result in small irrelevant clusters. This value defaults to 20 if the value is not supplied.
num_keywords
num_keywords
An integer that determines the number of keywords used for clustering. Defaults to 1 if not provided and requires a value of 1 or higher. Raising the value causes more keywords to be used for clustering but can result in irrelevant clusters.
ngram_weight
ngram_weight
An integer that determines how the model chooses N-Grams that are relevant to one another. The value dictates how many adjacent N-Grams are needed in order determine if text should be in the same cluster. Defaults to 4 if not provided. Increasing the value will result in tighter clustering while a lower value results in looser clustering.
umaphdbscan
umaphdbscan
Along with the task, hyperparameters
, max_iters
and autotune
needs to be provided in the header.
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
'X-Vyasa-Advanced-Parameters' :
{"task":"umaphdbscan",
"max_iters": None,
"autotune": True
"hyperparameters": {"n_neighbors" : 2, "n_components":1, "min_cluster_size": 1}}
}
max_iters
max_iters
An integer, defaults to None
if not supplied. Determines how many evaluations are done before concluding the clustering project.
autotune
autotune
A boolean value that determines if the preset hyperparameters
are used. Defaults to true if not provided.
hyperparameters
hyperparameters
A dictionary that contains three values n_neighbors
,n_componenets
and min_cluster_size
.
n_neighbors
n_neighbors
An integer greater then one, increasing this will result in larger cluster sizes.
n_components
n_components
An integer value, defaults to 5 if no value is given. Reduces the dimensionality of the embedding. Giving too low a value will result in improper clustering.
min_cluster_size
min_cluster_size
An integer value, defaults to 10 if no value is given. A smaller value causes small clusters to be created resulting in more clusters. It's better to increase this value then decrease it.
Putting It All Together!
If you are interested to see how these parameters come together to create text clustering project using document search view the guide Text Clustering Project with Document Search
Updated 6 months ago