Training a Classification Model
Introduction
Before we can use a model, one needs to be trained for use. The endpoint used will differ depending on what sort of model you want to train.
The header we would use is as follows.
header = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': f"Bearer {token}",
'X-Vyasa-Client': 'curate',
'X-Vyasa-Data-Fabric' : 'YOUR_FABRIC_ID'
}
General Parameters
The body of the training request will require you to provide specific parameters that are used to train the model.
Bucket
Bucket
A integer that denotes the amount of buckets the training material will be split into, giving a higher number will result in faster training time. Defaults to 2000000.
dim
dim
A integer that denotes the size of the word vectors in each dimension. A greater value results in longer training time but will offer a great depth of classification if the objects that need classification has a lot of properties. Defaults to 100
epoch
epoch
A integer that denotes how many epochs the material should go through. Defaults to 5.
loss
loss
A string value that determines the loss function used when tuning the model. This value can be either ns
, hs
, softmax
, or ova
lr
lr
A float denoting the learning rate percentage. A higher value will result in faster training time but may effect accuracy. It is recommended to experiment to this value, defaults to .1 .
lrUpdateRate
lrUpdateRate
A integer that determines the rate at which the learning rate is increased. A lower value results in faster training but cause incorrect results. Defaults to 100.
maxn
maxn
A integer that determines the max character ngram length. This determines the maximum length of the output string. Defaults to 0.
minn
minn
A integer that determines the min character ngram length. This determines the minimum length of the output string. Defaults to 0.
mincount
mincount
A integer that determines the minimum length of a word for the model to consider it relevant. Defaults to 1.
minlabel
minlabel
A integer that determines the minimum amount of labels to use when training the model. Defaults to 1.
neg
neg
A integer that determines how many negatives are sampled when training the model. A higher value will result in longer training time but can result in further accuracy when giving incorrect data. Defaults to 5.
rebalance
rebalance
A boolean that determines if the model should be re-balanced while being trained. This can result in greater accuracy but will result in longer training time. Defaults to FALSE.
t
t
A float that determines the threshold percent to determine if an output is relevant or not, while training. Using a higher threshold can result in greater accuracy but too high a value can result in incorrect outputs. Defaults to .0001
testRatio
testRatio
A float that determines the ratio of test data to training data. This value can be used to measure how efficient the model is. Defaults to 0.1 .
wordngrams
wordngrams
A integer that determines the max word ngram length. A larger or lower value can cause incorrect output. Defaults to 1.
ws
ws
A integer that determines the size of the context window. A higher or lower value can cause incorrect output. Defaults to 5.
modelName
modelName
A string that sets the name of the model. This value must be provided by the user making the API request.
trainingFile
trainingFile
A dictionary of values that contains id
, labelKey
, textKey
.
id
id
The document ID string of the spreadsheet containing the training data.
labelKey
labelKey
A string denoting the label column in the desired training data.
textkey
textkey
A string denoting the text column in the desired training data.
Column Information
The
textkey
andlabelkey
will look similar to "column_1_string". In order to find the proper column keys, please use the statement search endpoint to find the column key guide.
Model Training Request Body
The body for the training request would look as follows.
body = {
'bucket': 2000000,
'dim': 100,
'epoch': 5,
'loss': "softmax",
'lr': 0.1,
'lrUpdateRate': 100,
'maxn': 0,
'minCount': 1,
'minCountLabel': 1,
'minn': 0,
'modelName': "Test model",
'neg': 5,
'rebalance': false,
't': 0.0001,
'testRatio' : 0.1,
'trainingFile' : {
'labelKey': "column_2_long", 'id': "AZK6XnKaisXHr_SV5Vgv", 'textKey': "column_1_string"},
'wordNgrams': 1,
'ws': 5
}
POST Train Model Request
Now that we have the body, we can post the request and pull the modelId
and projectId
from the response.
modelTrainingUrl = f'{envUrl}/layar/text/classify/train'
response = requests.post(modelTrainingUrl,
headers = header,
json = body).json()
projectId = response.get("projectId")
Publish The Model
Once we have the modelId
we can use it to publish the model using the layar/model/{projectId}
endpoint.
modelUrl = f'{envUrl}/layar/model/{projectId}'
response = requests.put{modelUrl,
headers = header,
body = {'state': 'PUBLISHED'}
print(response)
#Successful response will be <200>
Updated 6 days ago