Training a Classification Model

Introduction

Before we can use a model, one needs to be trained for use. The endpoint used will differ depending on what sort of model you want to train.

The header we would use is as follows.

header = {'Accept': 'application/json',
          'Content-Type': 'application/json',
          'Authorization': f"Bearer {token}",
          'X-Vyasa-Client': 'curate',
	  'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID'
  	 }

General Parameters

The body of the training request will require you to provide specific parameters that are used to train the model.

`Bucket`

A integer that denotes the amount of buckets the training material will be split into, giving a higher number will result in faster training time. Defaults to 2000000.

`dim`

A integer that denotes the size of the word vectors in each dimension. A greater value results in longer training time but will offer a great depth of classification if the objects that need classification has a lot of properties. Defaults to 100

`epoch`

A integer that denotes how many epochs the material should go through. Defaults to 5.

`loss`

A string value that determines the loss function used when tuning the model. This value can be either ns, hs, softmax, or ova

`lr`

A float denoting the learning rate percentage. A higher value will result in faster training time but may effect accuracy. It is recommended to experiment to this value, defaults to .1 .

`lrUpdateRate`

A integer that determines the rate at which the learning rate is increased. A lower value results in faster training but cause incorrect results. Defaults to 100.

`maxn`

A integer that determines the max character ngram length. This determines the maximum length of the output string. Defaults to 0.

`minn`

A integer that determines the min character ngram length. This determines the minimum length of the output string. Defaults to 0.

`mincount`

A integer that determines the minimum length of a word for the model to consider it relevant. Defaults to 1.

`minlabel`

A integer that determines the minimum amount of labels to use when training the model. Defaults to 1.

`neg`

A integer that determines how many negatives are sampled when training the model. A higher value will result in longer training time but can result in further accuracy when giving incorrect data. Defaults to 5.

`rebalance`

A boolean that determines if the model should be re-balanced while being trained. This can result in greater accuracy but will result in longer training time. Defaults to FALSE.

`t`

A float that determines the threshold percent to determine if an output is relevant or not, while training. Using a higher threshold can result in greater accuracy but too high a value can result in incorrect outputs. Defaults to .0001

`testRatio`

A float that determines the ratio of test data to training data. This value can be used to measure how efficient the model is. Defaults to 0.1 .

`wordngrams`

A integer that determines the max word ngram length. A larger or lower value can cause incorrect output. Defaults to 1.

`ws`

A integer that determines the size of the context window. A higher or lower value can cause incorrect output. Defaults to 5.

`modelName`

A string that sets the name of the model. This value must be provided by the user making the API request.

`trainingFile`

A dictionary of values that contains id, labelKey , textKey.

`id`

The document ID string of the spreadsheet containing the training data.

`labelKey`

A string denoting the label column in the desired training data.

`textkey`

A string denoting the text column in the desired training data.

📘
Column Information
The textkey and labelkey will look similar to "column_1_string". In order to find the proper column keys, please use the statement search endpoint to find the column key guide.

Model Training Request Body

The body for the training request would look as follows.

body = {
  'bucket': 2000000,
	'dim': 100,
	'epoch': 5,
	'loss': "softmax",
	'lr': 0.1,
	'lrUpdateRate': 100,
	'maxn': 0,
	'minCount': 1,
	'minCountLabel': 1,
	'minn': 0,
	'modelName': "Test model",
	'neg': 5,
	'rebalance': false,
	't': 0.0001,
	'testRatio' : 0.1,
	'trainingFile' : {
    'labelKey': "column_2_long", 'id': "AZK6XnKaisXHr_SV5Vgv", 'textKey': "column_1_string"
  },
	'wordNgrams': 1,
	'ws': 5
}

POST Train Model Request

Now that we have the body, we can post the request and pull the modelId and projectId from the response.

modelTrainingUrl = f'{envUrl}/layar/text/classify/train'

response = requests.post(modelTrainingUrl,
                         headers = header,
                         json = body).json()

projectId = response.get("projectId")

Publish The Model

Once we have the modelId we can use it to publish the model using the layar/model/{projectId}endpoint.

modelUrl = f'{envUrl}/layar/model/{projectId}'

response = requests.put{modelUrl,
                        headers = header,
                        body = {'state': 'PUBLISHED'}

print(response)                        
#Successful response will be <200>

Introduction

General Parameters

Bucket

dim

epoch

loss

lr

lrUpdateRate

maxn

minn

mincount

minlabel

neg

rebalance

t

testRatio

wordngrams

ws

modelName

trainingFile

id

labelKey

textkey

📘Column Information

Model Training Request Body

POST Train Model Request

Publish The Model

`Bucket`

`dim`

`epoch`

`loss`

`lr`

`lrUpdateRate`

`maxn`

`minn`

`mincount`

`minlabel`

`neg`

`rebalance`

`t`

`testRatio`

`wordngrams`

`ws`

`modelName`

`trainingFile`

`id`

`labelKey`

`textkey`

📘
Column Information