Text Clustering Parameters

There are a variety of parameters that are needed in order to use the clustering endpoint /sourceDocument/startClusterDocsJob. These values need to be supplied in the header. This guide will go over what those values are and what they do.

Create Header

The header for this request will use X-Vyasa-Advanced-Parameters. This parameter takes a dictionary of values. The parameter can be used to determine the method of text clustering, either bertopic or umaphdbscan. Each type of task will need varying advanced parameters.

`bertopic`

When using the bertopic task, there are variety of advance parameters that can be used.

header = {'Accept': 'application/json',
          'Content-Type': 'application/json',
          'Authorization': f"Bearer {token}",
          'X-Vyasa-Client': 'layar',
          'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
				     'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
          'X-Vyasa-Advanced-Parameters' : 
          				{'task':'bertopic',
         			         'remove_stops': True,
                                         'stopwords' : [],
                                         'ngram_range : [1,1],
                                         'min_topic_size' : 20,
                                         'num_keywords' : 1,
                                         'ngram_weight' : 4}
  				}

`remove_stops`

A boolean value that determines if stop words should be removed when clustering. This defaults to True if not provided.

📘
What Is A Stop Word?
A stop word is an inconsequential/filler word. Examples would be conjunctions like "and" or "but".

`stopwords`

A list of strings. This value will only be used if remove_stopsis set to False. You can set specific words to be treated as stop words.

📘
Should You Use Custom Stopwords?
It's suggested to not initially use stop words. Setting remove_stopsto True causes the API to use generic stop words commonly found in the English language.

`ngram_range`

A range that determines how model associates topics with words. For example, "New York" is two words but represents a city. If an ngram_rangeof (1,1) is given, "New" and "York" would be looked at as separate words. Using a range of (1,2) would cause it to see "New York" as one word.

`min_topic_size`

An integer that determines the minimum size of a topic. Giving a value that is too large can result in no clusters being generated while giving a value too small can result in small irrelevant clusters. This value defaults to 20 if the value is not supplied.

`num_keywords`

An integer that determines the number of keywords used for clustering. Defaults to 1 if not provided and requires a value of 1 or higher. Raising the value causes more keywords to be used for clustering but can result in irrelevant clusters.

`ngram_weight`

An integer that determines how the model chooses N-Grams that are relevant to one another. The value dictates how many adjacent N-Grams are needed in order determine if text should be in the same cluster. Defaults to 4 if not provided. Increasing the value will result in tighter clustering while a lower value results in looser clustering.

`umaphdbscan`

Along with the task, hyperparameters, max_iters and autotune needs to be provided in the header.

header = {'Accept': 'application/json',
          'Content-Type': 'application/json',
          'Authorization': f"Bearer {token}",
          'X-Vyasa-Client': 'layar',
          'X-Vyasa-Data-Providers' : 'sandbox.certara.ai',
				     'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID',
          'X-Vyasa-Advanced-Parameters' : 
          			{"task":"umaphdbscan",
           			 "max_iters": None,
                 "autotune": True
                 "hyperparameters": {"n_neighbors" : 2, "n_components":1, "min_cluster_size": 1}}
  							}

`max_iters`

An integer, defaults to None if not supplied. Determines how many evaluations are done before concluding the clustering project.

`autotune`

A boolean value that determines if the preset hyperparameters are used. Defaults to true if not provided.

`hyperparameters`

A dictionary that contains three values n_neighbors ,n_componenets and min_cluster_size .

`n_neighbors`

An integer greater then one, increasing this will result in larger cluster sizes.

`n_components`

An integer value, defaults to 5 if no value is given. Reduces the dimensionality of the embedding. Giving too low a value will result in improper clustering.

`min_cluster_size`

An integer value, defaults to 10 if no value is given. A smaller value causes small clusters to be created resulting in more clusters. It's better to increase this value then decrease it.

📘
Putting It All Together!
If you are interested to see how these parameters come together to create text clustering project using document search view the guide Text Clustering Project with Document Search