Training a Classification Model
Introduction
Before we can use a model, one needs to be trained for use. The endpoint used will differ depending on what sort of model you want to train.
The header we would use is as follows.
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'curate',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID'
}
General Parameters
The body of the training request will require you to provide specific parameters that are used to train the model.
Bucket
BucketA integer that denotes the amount of buckets the training material will be split into, giving a higher number will result in faster training time. Defaults to 2000000.
dim
dimA integer that denotes the size of the word vectors in each dimension. A greater value results in longer training time but will offer a great depth of classification if the objects that need classification has a lot of properties. Defaults to 100
epoch
epochA integer that denotes how many epochs the material should go through. Defaults to 5.
loss
lossA string value that determines the loss function used when tuning the model. This value can be either ns, hs, softmax, or ova
lr
lrA float denoting the learning rate percentage. A higher value will result in faster training time but may effect accuracy. It is recommended to experiment to this value, defaults to .1 .
lrUpdateRate
lrUpdateRateA integer that determines the rate at which the learning rate is increased. A lower value results in faster training but cause incorrect results. Defaults to 100.
maxn
maxnA integer that determines the max character ngram length. This determines the maximum length of the output string. Defaults to 0.
minn
minnA integer that determines the min character ngram length. This determines the minimum length of the output string. Defaults to 0.
mincount
mincountA integer that determines the minimum length of a word for the model to consider it relevant. Defaults to 1.
minlabel
minlabelA integer that determines the minimum amount of labels to use when training the model. Defaults to 1.
neg
negA integer that determines how many negatives are sampled when training the model. A higher value will result in longer training time but can result in further accuracy when giving incorrect data. Defaults to 5.
rebalance
rebalanceA boolean that determines if the model should be re-balanced while being trained. This can result in greater accuracy but will result in longer training time. Defaults to FALSE.
t
tA float that determines the threshold percent to determine if an output is relevant or not, while training. Using a higher threshold can result in greater accuracy but too high a value can result in incorrect outputs. Defaults to .0001
testRatio
testRatioA float that determines the ratio of test data to training data. This value can be used to measure how efficient the model is. Defaults to 0.1 .
wordngrams
wordngramsA integer that determines the max word ngram length. A larger or lower value can cause incorrect output. Defaults to 1.
ws
wsA integer that determines the size of the context window. A higher or lower value can cause incorrect output. Defaults to 5.
modelName
modelNameA string that sets the name of the model. This value must be provided by the user making the API request.
trainingFile
trainingFileA dictionary of values that contains id, labelKey , textKey.
id
idThe document ID string of the spreadsheet containing the training data.
labelKey
labelKeyA string denoting the label column in the desired training data.
textkey
textkeyA string denoting the text column in the desired training data.
multiLabelView
multiLabelViewA boolean that determines if a row of training data can have multiple classes.
Multi-Class Training
The column you choose that contains the classes for the training data must have each class separated by a pipe |.
For example, if you want a row of training data to have 2 classes, you would make sure the cell in the class column looks like this: class1|class2
Column Information
The
textkeyandlabelkeywill look similar to "column_1_string". In order to find the proper column keys, please use the statement search endpoint to find the column key guide.
Model Training Request Body
The body for the training request would look as follows.
body = {
'multiLabelView': True,
'bucket': 2000000,
'dim': 100,
'epoch': 5,
'loss': "softmax",
'lr': 0.1,
'lrUpdateRate': 100,
'maxn': 0,
'minCount': 1,
'minCountLabel': 1,
'minn': 0,
'modelName': "Test model",
'neg': 5,
'rebalance': false,
't': 0.0001,
'testRatio' : 0.1,
'trainingFile' : {
'labelKey': "column_2_long", 'id': "AZK6XnKaisXHr_SV5Vgv", 'textKey': "column_1_string"
},
'wordNgrams': 1,
'ws': 5
}
POST Train Model Request
Now that we have the body, we can post the request and pull the modelId and projectId from the response.
modelTrainingUrl = f'{envUrl}/layar/text/classify/train'
response = requests.post(modelTrainingUrl,
headers = header,
json = body).json()
projectId = response.get("projectId")
Publish The Model
Once we have the modelId we can use it to publish the model using the layar/model/{projectId}endpoint.
modelUrl = f'{envUrl}/layar/model/{projectId}'
response = requests.put{modelUrl,
headers = header,
body = {'state': 'PUBLISHED'}
print(response)
#Successful response will be <200>
Updated 4 months ago
