Submitting a BulkDoc Model Job
Question Answer Jobs
Find the QA Module ID
Find the Module ID for the QA Models in your instance (in this instance, it will be layar_text_classification
). Look under the moduleType
param for the module name of layar-question-answering
, and then pull the id
for that module's object. In this example call, the ID we found was AX9183vTOBLJbFhFZLqn
.
response = requests.get(f"{envUrl}/layar/module",
data = json.dumps({
"q":{
"moduleType":["layar_question_answering"]}, # QA Module Type (q param is optional)
"rows":1000,
"start":0}),
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}"
}
)
#response.json() #Uncomment if looking for response only (not in pandas dataframe)
data = response.json() # Comment if not using dataframe
modules = pd.DataFrame.from_dict(data) # Comment if not using dataframe
modules # Comment if not using dataframe
{'moduleType': 'layar_question_answering',
'description': 'Information extraction from unstructured text using natural language question answering',
'enabled': False,
'canEdit': False,
'jobConfig': {'dateIndexed': '2022-03-10T22:29:22.239+0000',
'jobAttributes': {},
'executor': 'VYASA_EXECUTOR_QUEUE'},
'inputTypes': [],
'id': 'AX9183vTOBLJbFhFZLqn',
'name': 'Layar Natural Language Question Answering',
'dateIndexed': '2022-03-10T22:29:25.585+0000',
'datePublished': '2022-03-10T22:29:25.585+0000'}
Below is an example dataframe you can expect using the above call:
| | name | moduleType | id | dateIndexed | datePublished |
|---:|:----------------------------------------------------------|:--------------------------|:---------------------|:-----------------------------|:-----------------------------|
| 0 | ADME | nan | AWK6cp_aSspz4MTBePPu | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 1 | BACE Classification | nan | AX9183UfOBLJbFhFZLqj | 2022-03-10T22:29:23.870+0000 | 2022-03-10T22:29:23.870+0000 |
| 2 | BACE Regression | nan | AWK6cp_-Sspz4MTBePPw | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 3 | CHEMBL multitask-regression | nan | AWK6cqAPSspz4MTBePPx | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 4 | ChemFormer Molecule Generation | megamolbart_generate | AYJsSMGyQ_VWiIwFEMzW | 2022-08-05T04:34:43.741+0000 | 2022-08-05T04:34:43.741+0000 |
| 5 | ChemProp Molecular Property Prediction | chem_prop | AX9183uyOBLJbFhFZLqm | 2022-03-10T22:29:25.552+0000 | 2022-03-10T22:29:25.552+0000 |
| 6 | ChemVector | chem_vector | AWK6cp_ISspz4MTBePPt | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 7 | ChemVector 2.0 | tdrug_molecule_generation | AYCVMaGgEjSwCRsYhyyf | 2022-05-05T17:08:19.486+0000 | 2022-05-05T17:08:19.486+0000 |
| 8 | ClinTox | nan | AWK6cqAZSspz4MTBePPy | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 9 | CustomScript | custom_script | AWYbxyAJwMCmTxcnnM5k | 2018-09-27T16:05:26.611+0000 | 2018-09-27T16:05:26.611+0000 |
| 10 | ESOL (delaney dataset) | nan | AWTwZ8L5kSKN6xm9zAj2 | 2018-07-31T12:54:49.214+0000 | 2018-07-31T12:54:49.214+0000 |
| 11 | Free Solvation (FreeSolv) regression module | nan | AWK6cqDRSspz4MTBePP_ | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 12 | HIV | nan | AWK6cqA7Sspz4MTBePP0 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 13 | HOPV | nan | AWK6cqBMSspz4MTBePP1 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 14 | ImageVec | image_vec | AWNz5-Py6rdBZlV6D1-7 | 2018-05-18T15:39:27.067+0000 | 2018-05-18T15:39:27.067+0000 |
| 15 | ImageVec 2.0 | image_vec_v2 | AWY6fgfjfEokcWOJyYLA | 2018-10-03T15:13:49.590+0000 | 2018-10-03T15:13:49.590+0000 |
| 16 | Kaggle | nan | AWK6cqBWSspz4MTBePP2 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 17 | Layar Natural Language Question Answering | layar_question_answering | AX9183vTOBLJbFhFZLqn | 2022-03-10T22:29:25.585+0000 | 2022-03-10T22:29:25.585+0000 |
| 18 | Layar Text Classification | layar_text_classification | AYKnBVJu7_KUc2Z62iTx | 2022-08-16T14:18:40.087+0000 | 2022-08-16T14:18:40.087+0000 |
| 19 | Layar Text Clustering | TextClustering | AYHXS4c-h7kwf0S5cmUr | 2022-07-07T06:14:20.203+0000 | 2022-07-07T06:14:20.203+0000 |
| 20 | Maximum Unbiased Validation | nan | AWK6cqBiSspz4MTBePP3 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 21 | MedVec | med_vec | AWYbj5OhwMCmTxcnnM4v | 2018-09-27T15:04:46.153+0000 | 2018-09-27T15:04:46.153+0000 |
| 22 | Molecular Property Prediction | tdrug_prop_prediction | AYCVMaDhEjSwCRsYhyye | 2022-05-05T17:08:19.279+0000 | 2022-05-05T17:08:19.279+0000 |
| 23 | Molecule Autoencoder | nan | AWTwZsrlkSKN6xm9zAjZ | 2018-07-31T12:53:45.754+0000 | 2018-07-31T12:53:45.754+0000 |
| 24 | Named Entity Recognition | named_entity_recognition | AXToSk-Kskkh33qDz7vC | 2020-10-02T07:50:18.248+0000 | 2020-10-02T07:50:18.248+0000 |
| 25 | National Cancer Institute (NCI) multitask-regression | nan | AWK6cqB1Sspz4MTBePP4 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 26 | PDBbind Pockets Classification (bbbp) | nan | AWK6cqCPSspz4MTBePP6 | 2018-04-12T15:21:36.864+0000 | 2018-04-12T15:21:36.864+0000 |
| 27 | PDBbind Regression | nan | AWK6cqCcSspz4MTBePP7 | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 28 | PubChem BioAssay (PCBA) multitask-classification | nan | AWK6cqCCSspz4MTBePP5 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 29 | QM7 & QM7b multitask-regression | nan | AWK6cqCnSspz4MTBePP8 | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 30 | QM8 | nan | AX9183nzOBLJbFhFZLqk | 2022-03-10T22:29:25.106+0000 | 2022-03-10T22:29:25.106+0000 |
| 31 | QM9 | nan | AWK6cqDGSspz4MTBePP- | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 32 | Relation Extraction | RelationExtractor | AYC_AzxuW-fZE5hXgsWd | 2022-05-13T20:01:41.974+0000 | 2022-05-13T20:01:41.974+0000 |
| 33 | The Side Effect Resource (SIDER) multitask-classification | nan | AWK6cqDcSspz4MTBePQA | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 34 | Tox21 Graph Convolutional Neural Network | tox_21 | AX9183MoOBLJbFhFZLqi | 2022-03-10T22:29:23.341+0000 | 2022-03-10T22:29:23.341+0000 |
| 35 | ToxCast | nan | AX9183r0OBLJbFhFZLql | 2022-03-10T22:29:25.363+0000 | 2022-03-10T22:29:25.363+0000 |
Based on the response above, the "Layar Natural Language Question Answering" module has the module id: AX9183vTOBLJbFhFZLqn
.
Find the QA Model ID
Get all Published Models under this module (moduleID: AX9183vTOBLJbFhFZLqn
). Unless you have trained a QA model in Curate, you should expect to only see one result. As mentioned before, if you already know the model ID you wish to apply, skip to running the bulkDoc QA section.
response = requests.get(f"{envUrl}/layar/module/AX9183vTOBLJbFhFZLqn/models", # Specify Module ID
data = json.dumps({
"moduleId":"AX9183vTOBLJbFhFZLqn"}), # Specify Module ID
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}"
}
)
data = response.json()
models = pd.DataFrame.from_dict(data)
models
[{'description': 'SQUADv2 Question Answering Deep Learning Model',
'moduleId': 'AX9183vTOBLJbFhFZLqn',
'displayName': 'BioMegatron SQuAD v2',
'diagnostics': {},
'state': 'PUBLISHED',
'id': 'AX9184SyOBLJbFhFZLqp',
'name': 'biomegatron_squad_onnx',
'dateIndexed': '2022-03-10T22:29:27.856+0000',
'datePublished': '2022-03-10T22:29:27.856+0000'}]
Below is an example dataframe for the response:
| | description | moduleId | displayName | diagnostics | state | id | name | dateIndexed | datePublished |
|---:|:-----------------------------------------------|:---------------------|:---------------------|:--------------|:----------|:---------------------|:-----------------------|:-----------------------------|:-----------------------------|
| 0 | SQUADv2 Question Answering Deep Learning Model | AX9183vTOBLJbFhFZLqn | BioMegatron SQuAD v2 | {} | PUBLISHED | AX9184SyOBLJbFhFZLqp | biomegatron_squad_onnx | 2022-03-10T22:29:27.856+0000 | 2022-03-10T22:29:27.856+0000 |
Look for the id
in the response objects. This is each model's modelId
, and will be what we use when we submit a QA job for Curate. In this example, the Biomegatron Squad model (the default extractive QA model in Curate) has an ID of AX9184SyOBLJbFhFZLqp
.
Submit BulkDocs QA Job
Ask the Question on the Curate Document Set using the /question/startBulkQuestionAnswerJob
POST method. We'll ask two questions (be sure to submit two separate calls, one for each question): "What is the disease or indication being studied?" and "Where is the osteonecrosis located on the body?"
response = requests.post(f"{envUrl}/layar/question/startBulkQuestionAnswerJob",
data = json.dumps({
"questionGroupingKey":"AYXweEvEEs8gbQkuEVT1", # Specify the Curate Document Set by Set ID
"bulkQuestions":[{
"questionKey":"Disease", # Column Header that will be visible in Curate
"questionStringVariations":["What is the disease or indication being studied?"], # Question Asked
"fillStrategy": "BEST_CURATION_THEN_BEST_PREDICTIONS", # rank answers in Curate UI: first by curator's answers, followed by probability score
"search":{"paragraphSearchCommand":{}}, # Optional: for QA on Specific Sections e.g. Abstracts, Introduction, Inclusion/Exclusion, etc.
"deepLearningModelId":"AX9184SyOBLJbFhFZLqp", # QA Model to Use (default is Biomegatron SqUAD v2)
"advancedParams":{"null_threshold":0,"chunk_size":5}, # Optional: Advanced QA Inferencing Parameters
"typeOfSearch":"DOCUMENT_AS_PARAGRAPHS","truncateTextToLength":20000}],
"numberOfBestAnswers":9, # Number of Model Predictions to Save to Question Object, ranked by probability (Default: 9)
"sourceDocumentSearchCommand":{
"savedListIds":["AYXweEvEEs8gbQkuEVT1"], # Specify the Curate Document Set by Set ID
"rows":500}}), # Update if more than 500 documents in the set. Limited to 10K rows.
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'curate'
}
)
jid = response.json()
print(jid)
{'jobId': '179d931a-acff-4756-8938-dbe407585490'}
Chained QA Jobs
You can also create "chained" questions using this call, using ${Question_Key} in the questionStringVariations parameter to call on the primary question(s) you are building your chained question from.
As an example, the above example question asks "What is the disease being studied?" with a question key of "Disease". This will be your primary question (you can have multiple).
You then feed the answers from that question into a second question using a template. The chained question would be: "What drugs were studied as treatment for ${Disease}?". In this second question, the ${Disease} triggers the QA job to populate that placeholder with the answers from the first question:
response = requests.post(f"{envUrl}/layar/question/startBulkQuestionAnswerJob",
data = json.dumps({
"questionGroupingKey":"AYXweEvEEs8gbQkuEVT1", # Specify the Curate Document Set by Set ID
"bulkQuestions":[{
"questionKey":"Chained Question Test", # Column Header that will be visible in Curate
"questionStringVariations":["What drugs were studied as treatment for ${Disease}?"], # Question Asked
"search":{"paragraphSearchCommand":{}}, # Optional: for QA on Specific Sections e.g. Abstracts, Introduction, Inclusion/Exclusion, etc.
"fillStrategy": "BEST_CURATION_THEN_BEST_PREDICTIONS", # rank order for answers in Curate UI: first by curator's answers, followed by probability score
"deepLearningModelId":"AX9184SyOBLJbFhFZLqp", # QA Model to Use (default is Biomegatron SqUAD v2)
"advancedParams":{"null_threshold":0,"chunk_size":5}, # Optional: Advanced QA Inferencing Parameters
"typeOfSearch":"DOCUMENT_AS_PARAGRAPHS","truncateTextToLength":20000}],
"numberOfBestAnswers":9, # Number of Model Predictions to Save to Question Object, ranked by probability (Default: 9)
"sourceDocumentSearchCommand":{
"savedListIds":["AYXweEvEEs8gbQkuEVT1"], # Specify the Curate Document Set by Set ID
"rows":500}}), # Update if more than 500 documents in the set. Limited to 10K rows.
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'curate'
}
)
response.json()
Note: At the point of making the chained question call, if there are any existing curations from users performed on the primary question, those answers will be put forward as the inputs for the chained question call, in lieu of all of the model's predictions.
Retrieve Job Status
Job Queueing Status
If you have a lot of documents (over 100), you may be interested in logging this. Otherwise, you can skip and proceed to the next job status call.
response = requests.get(f"{envUrl}/layar/event",
data = json.dumps({
"subjectId": "c22b2f98-baed-411f-ae6d-dc0aa44a7610",
"rows": "50",
"start": "0"}),
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'layar'
}
)
pd.set_option('display.max_rows', 100) # Change the setting for pandas df if you want to view more rows than what's hidden in the notebook!
status = response.json()
job_status = pd.DataFrame.from_dict(status)
job_status
Below is an example dataframe of the response:
| | subjectId | userId | message | attributes | id | dateIndexed | datePublished | projectId |
|---:|:-------------------------------------|---------:|:--------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------|:-----------------------------|:-----------------------------|:-------------------------------------------------------------------------|
| 0 | 179d931a-acff-4756-8938-dbe407585490 | 25036 | done queueing bulk qa job 179d931a-acff-4756-8938-dbe407585490 | {} | AYXwh2r3Es8gbQkuEVZM | 2023-01-27T00:01:24.719+0000 | 2023-01-27T00:01:24.719+0000 | nan |
| 1 | 179d931a-acff-4756-8938-dbe407585490 | 25036 | collecting source docs - found totalDocsMatchingQuery 94 for master-pubmed.vyasa.com, page: 94, totalExpectedQuestions: 94 | {'totalDocsMatchingQuery': 94, 'dataProvider': 'master-pubmed.vyasa.com', 'sizeOfPage': 94, 'totalExpectedQuestion': 94, 'countOfQuestionsToAsk': 1} | AYXwh2OpEs8gbQkuEVZB | 2023-01-27T00:01:22.855+0000 | 2023-01-27T00:01:22.855+0000 | nan |
| 2 | 179d931a-acff-4756-8938-dbe407585490 | 25036 | collecting source docs - found totalDocsMatchingQuery 7 for master-clinicaltrials.vyasa.com, page: 7, totalExpectedQuestions: 7 | {'totalDocsMatchingQuery': 7, 'dataProvider': 'master-clinicaltrials.vyasa.com', 'sizeOfPage': 7, 'totalExpectedQuestion': 7, 'countOfQuestionsToAsk': 1} | AYXwh2LAEs8gbQkuEVY_ | 2023-01-27T00:01:22.622+0000 | 2023-01-27T00:01:22.622+0000 | nan |
| 3 | 179d931a-acff-4756-8938-dbe407585490 | 25036 | collecting source docs - found totalDocsMatchingQuery 11 for master-pmc-oa.vyasa.com, page: 11, totalExpectedQuestions: 11 | {'totalDocsMatchingQuery': 11, 'dataProvider': 'master-pmc-oa.vyasa.com', 'sizeOfPage': 11, 'totalExpectedQuestion': 11, 'countOfQuestionsToAsk': 1} | AYXwh2EmEs8gbQkuEVY9 | 2023-01-27T00:01:22.212+0000 | 2023-01-27T00:01:22.212+0000 | nan |
| 4 | AYXweEvEEs8gbQkuEVT1 | 25036 | Adding What is the disease or indication being studied? to bulk QA run | {'jobId': '179d931a-acff-4756-8938-dbe407585490', 'jobGroupingKey': 'AYXweEvEEs8gbQkuEVT1', 'questionKey': 'What is the disease or indication being studied?'} | AYXwh2B5Es8gbQkuEVY8 | 2023-01-27T00:01:22.039+0000 | 2023-01-27T00:01:22.039+0000 | AYXweEvEEs8gbQkuEVT1::::What is the disease or indication being studied? |
| 5 | 179d931a-acff-4756-8938-dbe407585490 | 25036 | starting to collect source docs - found totalDocsMatchingQuery 112, totalExpectedQuestions: 112 | {'totalDocsMatchingQuery': 112, 'totalExpectedQuestion': 112, 'countOfQuestionsToAsk': 1} | AYXwh2BlEs8gbQkuEVY7 | 2023-01-27T00:01:22.019+0000 | 2023-01-27T00:01:22.019+0000 | nan |
| 6 | 179d931a-acff-4756-8938-dbe407585490 | 25036 | Starting bulk QA run | {'jobGroupingKey': 'AYXweEvEEs8gbQkuEVT1', 'sourceDocumentSearchCommand': {'filterOp': 'AND', 'highlight': False, 'highlightPreTag': '<em>', 'highlightPostTag': '</em>', 'logSearch': True, 'randomize': False, 'savedListIds': ['AYXweEvEEs8gbQkuEVT1'], 'rows': 500, 'start': 0, 'sortOrder': 'asc'}, 'limitToAnnotationPrefix': '', 'hosts': ['master-pubmed.vyasa.com', 'master-pmc-oa.vyasa.com', 'master-clinicaltrials.vyasa.com']} | AYXwh1-xEs8gbQkuEVY5 | 2023-01-27T00:01:21.839+0000 | 2023-01-27T00:01:21.839+0000 | AYXweEvEEs8gbQkuEVT1 |
QA Job Completion
To monitor the status of the QA inferencing (as you would in the Curate software), you would use the question/batch/search
endpoint. The response will show you:
questionCount
: The total number of questions in this job. This is the sum of queued and completed.questionsQueued
: These are the questions yes to be answer (run through QA).questionsCompleted
: Should always be the sum of answered, skipped, and failed.questionsAnswered
: These are the questions that were successfully submitted to the QA service, and received a response (not necessarily an answer)questionsSkipped
: There are two scenarios where a question may be skipped:- The job batch was cancelled, at which point, any questions that have been queued for processing are removed and skipped.
- If this question has already been asked, it will be skipped. The definition of already being asked means that the
typeOfSearch
,queryString
,batchGroupingKey
,questionKey
, andsingleDocQuestionDocumentId
are all the same as a question prior.
questionsFailed
: Questions where an error occurred somewhere in the pipeline of either (1) submitting that document for QA or (2) apply QA to that document. For example, a server being down on the original docs storage location can impact its ability to be queued for processing.
Below we are just going to use the time.sleep
package to poll for how many documents remain in the queue, so you can generate a log as the QA job moves through the document set.
from time import sleep
ids = jid['jobId'] # This is the job ID from the response when submitting a Bulk QA job
sec = 5
rounds = 1
total_sec = 0
while True:
print(f"Iteration {rounds}. Timestamp: ", time.ctime())
response = requests.post(f"{envUrl}/layar/question/batch/search",
data = json.dumps({
"rows":"500",
"start":"0",
"savedListIds":["AYXweEvEEs8gbQkuEVT1"], # Note the set ID
"jobIds": [ids]}),
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'layar'
}
)
batch = response.json()[0]
print(f"Number of Docs Remaining: {batch['questionsQueued']}")
# Timestamping
rounds = rounds + 1
total_sec = total_sec + sec
if batch["questionsQueued"] == 0:
break
sleep(sec)
print("Job Complete. Timestamp: ", time.ctime(), f"Total Number of Iterations: {rounds+1}. Total Number of Seconds: {total_sec+sec}")
Print lines should move there way down through the number of docs remaining in the queue:
Iteration 1. Timestamp: Thu Jan 26 19:01:42 2023
Number of Docs Remaining: 66
Iteration 2. Timestamp: Thu Jan 26 19:01:47 2023
Number of Docs Remaining: 32
Iteration 3. Timestamp: Thu Jan 26 19:01:52 2023
Number of Docs Remaining: 0
Job Complete. Timestamp: Thu Jan 26 19:01:53 2023 Total Number of Iterations: 5. Total Number of Seconds: 20
Short Jobs Will Likely Not Return Logs
If you're making these calls in the tutorial on a small set of documents (less than 100 docs), it's possible the bulk QA job will already be complete, and therefore this call will not return anything.
Classification Jobs
Find the Classification Module ID
Find the Module ID for the Classification Models in your instance (in this instance, it will be layar_text_classification
). Based on the response above, the "Layar Text Classification" module has module id: AYKnBVJu7_KUc2Z62iTx
.
response = requests.get(f"{envUrl}/layar/module",
data = json.dumps({
"q":{
"moduleType":["layar_text_classification"]}, # Classification Module Type (optional)
"rows":1000,
"start":0}),
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}"
}
)
data = response.json()
modules = pd.DataFrame.from_dict(data)
modules = modules.drop(['description', 'jobConfig', 'enabled', 'canEdit', 'inputTypes'], axis=1)
modules
{'moduleType': 'layar_text_classification',
'description': 'Vectorize strings and classifying by object type.',
'enabled': True,
'canEdit': False,
'jobConfig': {'dateIndexed': '2022-08-16T14:18:39.401+0000',
'jobAttributes': {},
'executor': 'VYASA_EXECUTOR_QUEUE'},
'inputTypes': ['DOCUMENT'],
'id': 'AYKnBVJu7_KUc2Z62iTx',
'name': 'Layar Text Classification',
'dateIndexed': '2022-08-16T14:18:40.087+0000',
'datePublished': '2022-08-16T14:18:40.087+0000'}
The example dataframe will be the same as that made when using this call for the layar_question_answering
module.
| | name | moduleType | id | dateIndexed | datePublished |
|---:|:----------------------------------------------------------|:--------------------------|:---------------------|:-----------------------------|:-----------------------------|
| 0 | ADME | nan | AWK6cp_aSspz4MTBePPu | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 1 | BACE Classification | nan | AX9183UfOBLJbFhFZLqj | 2022-03-10T22:29:23.870+0000 | 2022-03-10T22:29:23.870+0000 |
| 2 | BACE Regression | nan | AWK6cp_-Sspz4MTBePPw | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 3 | CHEMBL multitask-regression | nan | AWK6cqAPSspz4MTBePPx | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 4 | ChemFormer Molecule Generation | megamolbart_generate | AYJsSMGyQ_VWiIwFEMzW | 2022-08-05T04:34:43.741+0000 | 2022-08-05T04:34:43.741+0000 |
| 5 | ChemProp Molecular Property Prediction | chem_prop | AX9183uyOBLJbFhFZLqm | 2022-03-10T22:29:25.552+0000 | 2022-03-10T22:29:25.552+0000 |
| 6 | ChemVector | chem_vector | AWK6cp_ISspz4MTBePPt | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 7 | ChemVector 2.0 | tdrug_molecule_generation | AYCVMaGgEjSwCRsYhyyf | 2022-05-05T17:08:19.486+0000 | 2022-05-05T17:08:19.486+0000 |
| 8 | ClinTox | nan | AWK6cqAZSspz4MTBePPy | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 9 | CustomScript | custom_script | AWYbxyAJwMCmTxcnnM5k | 2018-09-27T16:05:26.611+0000 | 2018-09-27T16:05:26.611+0000 |
| 10 | ESOL (delaney dataset) | nan | AWTwZ8L5kSKN6xm9zAj2 | 2018-07-31T12:54:49.214+0000 | 2018-07-31T12:54:49.214+0000 |
| 11 | Free Solvation (FreeSolv) regression module | nan | AWK6cqDRSspz4MTBePP_ | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 12 | HIV | nan | AWK6cqA7Sspz4MTBePP0 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 13 | HOPV | nan | AWK6cqBMSspz4MTBePP1 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 14 | ImageVec | image_vec | AWNz5-Py6rdBZlV6D1-7 | 2018-05-18T15:39:27.067+0000 | 2018-05-18T15:39:27.067+0000 |
| 15 | ImageVec 2.0 | image_vec_v2 | AWY6fgfjfEokcWOJyYLA | 2018-10-03T15:13:49.590+0000 | 2018-10-03T15:13:49.590+0000 |
| 16 | Kaggle | nan | AWK6cqBWSspz4MTBePP2 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 17 | Layar Natural Language Question Answering | layar_question_answering | AX9183vTOBLJbFhFZLqn | 2022-03-10T22:29:25.585+0000 | 2022-03-10T22:29:25.585+0000 |
| 18 | Layar Text Classification | layar_text_classification | AYKnBVJu7_KUc2Z62iTx | 2022-08-16T14:18:40.087+0000 | 2022-08-16T14:18:40.087+0000 |
| 19 | Layar Text Clustering | TextClustering | AYHXS4c-h7kwf0S5cmUr | 2022-07-07T06:14:20.203+0000 | 2022-07-07T06:14:20.203+0000 |
| 20 | Maximum Unbiased Validation | nan | AWK6cqBiSspz4MTBePP3 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 21 | MedVec | med_vec | AWYbj5OhwMCmTxcnnM4v | 2018-09-27T15:04:46.153+0000 | 2018-09-27T15:04:46.153+0000 |
| 22 | Molecular Property Prediction | tdrug_prop_prediction | AYCVMaDhEjSwCRsYhyye | 2022-05-05T17:08:19.279+0000 | 2022-05-05T17:08:19.279+0000 |
| 23 | Molecule Autoencoder | nan | AWTwZsrlkSKN6xm9zAjZ | 2018-07-31T12:53:45.754+0000 | 2018-07-31T12:53:45.754+0000 |
| 24 | Named Entity Recognition | named_entity_recognition | AXToSk-Kskkh33qDz7vC | 2020-10-02T07:50:18.248+0000 | 2020-10-02T07:50:18.248+0000 |
| 25 | National Cancer Institute (NCI) multitask-regression | nan | AWK6cqB1Sspz4MTBePP4 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 26 | PDBbind Pockets Classification (bbbp) | nan | AWK6cqCPSspz4MTBePP6 | 2018-04-12T15:21:36.864+0000 | 2018-04-12T15:21:36.864+0000 |
| 27 | PDBbind Regression | nan | AWK6cqCcSspz4MTBePP7 | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 28 | PubChem BioAssay (PCBA) multitask-classification | nan | AWK6cqCCSspz4MTBePP5 | 2018-04-12T15:21:36.863+0000 | 2018-04-12T15:21:36.863+0000 |
| 29 | QM7 & QM7b multitask-regression | nan | AWK6cqCnSspz4MTBePP8 | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 30 | QM8 | nan | AX9183nzOBLJbFhFZLqk | 2022-03-10T22:29:25.106+0000 | 2022-03-10T22:29:25.106+0000 |
| 31 | QM9 | nan | AWK6cqDGSspz4MTBePP- | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 32 | Relation Extraction | RelationExtractor | AYC_AzxuW-fZE5hXgsWd | 2022-05-13T20:01:41.974+0000 | 2022-05-13T20:01:41.974+0000 |
| 33 | The Side Effect Resource (SIDER) multitask-classification | nan | AWK6cqDcSspz4MTBePQA | 2018-04-12T15:21:36.865+0000 | 2018-04-12T15:21:36.865+0000 |
| 34 | Tox21 Graph Convolutional Neural Network | tox_21 | AX9183MoOBLJbFhFZLqi | 2022-03-10T22:29:23.341+0000 | 2022-03-10T22:29:23.341+0000 |
| 35 | ToxCast | nan | AX9183r0OBLJbFhFZLql | 2022-03-10T22:29:25.363+0000 | 2022-03-10T22:29:25.363+0000 |
Find the Model ID
Get all Published Models under this Layar Text Classification module (moduleId: AYKnBVJu7_KUc2Z62iTx
). Look for the id
in the response objects. This is each model's model id
, and will be what we use when we submit a classification job for Curate.
response = requests.get(f"{envUrl}/layar/module/AYKnBVJu7_KUc2Z62iTx/models",
data = json.dumps({
"moduleId":"AYKnBVJu7_KUc2Z62iTx"}),
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Data-Fabirc' : dataFabric
}
)
data = response.json()
models = pd.DataFrame.from_dict(data)
models
[{'moduleId': 'AYTErANYd9YtLmFge_NC',
'displayName': 'Cancer - Binary Text Classification',
'url': 's3://cortex-resource-data/modules/layar_text_classification/cancer_classification_model.bin',
'diagnostics': {},
'state': 'PUBLISHED',
'id': 'AYTErAaYd9YtLmFge_NN',
'name': 'cancer_binary_classification_model',
'dateIndexed': '2022-11-29T18:35:19.063+0000',
'datePublished': '2022-11-29T18:35:19.063+0000'},
{'moduleId': 'AYTErANYd9YtLmFge_NC',
'displayName': 'Heme - Binary Text Classification',
'url': 's3://cortex-resource-data/modules/layar_text_classification/heme_classification_model.bin',
'diagnostics': {},
'state': 'PUBLISHED',
'id': 'AYTErAZgd9YtLmFge_NM',
'name': 'heme_binary_classification_model',
'dateIndexed': '2022-11-29T18:35:19.006+0000',
'datePublished': '2022-11-29T18:35:19.006+0000'}]
Here's an example dataframe of the response:
| | moduleId | displayName | url | diagnostics | state | id | name | dateIndexed | datePublished |
|---:|:---------------------|:------------------------------------|:--------------------------------------------------------------------------------------------|:--------------|:----------|:---------------------|:-----------------------------------|:-----------------------------|:-----------------------------|
| 0 | AYTErANYd9YtLmFge_NC | Cancer - Binary Text Classification | s3://cortex-resource-data/modules/layar_text_classification/cancer_classification_model.bin | {} | PUBLISHED | AYKnBVTy7_KUc2Z62iTz | cancer_binary_classification_model | 2022-11-29T18:35:19.063+0000 | 2022-11-29T18:35:19.063+0000 |
| 1 | AYTErANYd9YtLmFge_NC | Heme - Binary Text Classification | s3://cortex-resource-data/modules/layar_text_classification/heme_classification_model.bin | {} | PUBLISHED | AYTErAZgd9YtLmFge_NM | heme_binary_classification_model | 2022-11-29T18:35:19.006+0000 | 2022-11-29T18:35:19.006+0000 |
Submit Classification Job
Classifying Text Documents
Specify the ID of the Curate Document Set you wish to run this model on, and remember to set the X-Vyasa-Client
header as curate
. For this example, we'll run cancer and heme classification models (run two separate calls, one for each model job to submit).
response = requests.post(f"{envUrl}/layar/text/classify",
data = json.dumps({
"search":{
"savedListIds":["AYXweEvEEs8gbQkuEVT1"], # Curate Set ID
"dataProviders":["master-pubmed.vyasa.com", "master-pmc-oa.vyasa.com","master-clinicaltrials.vyasa.com"]}, # Data Sources being used within Set (for this demo, PubMed only)
"modelId":"AYKnBVTy7_KUc2Z62iTz"}), # Classification Model of Interest - Heme
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'curate' # Specify the application as 'curate',
'X-Vyasa-Data-Fabirc' : dataFabric
}
)
job = response.json()
jid = job['id']
response.json()
{'status': 'RUNNING',
'computationMode': 'EXECUTING',
'projectId': 'AYMe8RXI7_KUc2Z69GQg',
'modelId': 'AYKnBVTy7_KUc2Z62iTz',
'computationParameters': {'modelId': 'AYKnBVTy7_KUc2Z62iTz',
'search': {'dataProviders': ['master-pubmed.vyasa.com',
'master-pmc-oa.vyasa.com',
'master-clinicaltrials.vyasa.com'],
'filterOp': 'AND',
'highlight': False,
'highlightPreTag': '<em>',
'highlightPostTag': '</em>',
'logSearch': True,
'randomize': False,
'savedListIds': ['AYXweEvEEs8gbQkuEVT1'],
'start': 0,
'sortOrder': 'asc'}},
'id': '13707bc3-434e-4d4b-9b55-510cbe70fae2',
'dateIndexed': '2023-01-27T00:04:39.821+0000',
'createdByUser': 25036}
Classifying Spreadsheets
The request will look slightly different for spreadsheets.
response = requests.post(f"{envUrl}/layar/text/classify",
data = json.dumps({
"search":{ # You can search for any document/spreadsheet you would like to classify.
"ids":["AweEvFGvs8gbQkuEVT1"], # Document Id
"modelId":"AYKnBVTy7_KUc2Z62iTz"}), # Classification Model of Interest - Heme
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'curate' # Specify the application as 'curate',
'X-Vyasa-Data-Fabirc' : dataFabric
}
)
job = response.json()
jid = job['id']
response.json()
The id
generated as part of this response (and now represented by the variable jid
) is the *project computation ID. In the above example, the job's ID is: 13707bc3-434e-4d4b-9b55-510cbe70fae2
.
Retrieve Job Status
The id
provided in the response above is the projectComputation's id. You can use this in the following request to continue to check status on this job. You can set this on a scheduler to make this call how ever often you decide.
response = requests.get(f"{envUrl}/layar/projectComputation/{jid}", # note the job ID, provided in the response of the previous call
data = json.dumps({"id": jid}),
headers = {
'accept':'application/json',
'content-Type':'application/json',
'authorization':f"Bearer {token}",
'X-Vyasa-Client': 'layar',
'X-Vyasa-Data-Fabirc' : dataFabric
}
)
response.json()
## Uncomment if using dataframe:
#data = response.json()
#pd.set_option('display.max_rows', 100) # Change the setting for pandas df if you want to view more rows than what's hidden in the notebook!
#status = pd.DataFrame.from_dict(data)
#status
{'status': 'RUNNING',
'computationMode': 'EXECUTING',
'projectId': 'AYMe8RXI7_KUc2Z69GQg',
'modelId': 'AYKnBVTy7_KUc2Z62iTz',
'computationParameters': {'modelId': 'AYKnBVTy7_KUc2Z62iTz',
'search': {'dataProviders': ['master-pubmed.vyasa.com'],
'filterOp': 'AND',
'highlight': False,
'highlightPreTag': '<em>',
'highlightPostTag': '</em>',
'logSearch': True,
'randomize': False,
'savedListIds': ['AYXweEvEEs8gbQkuEVT1'],
'start': 0,
'sortOrder': 'asc'}},
'id': '28748ef3-1c3e-417b-99dc-a46443b588cc',
'dateIndexed': '2023-01-26T23:58:30.834+0000',
'datePublished': '2023-01-26T23:58:30.834+0000',
'createdByUser': 25036}
Example dataframe:
| | jobId | batchGroupingKey | questionKeys | savedListIds | questionCount | questionsQueued | questionsCompleted | questionsAnswered | questionsSkipped | questionsFailed | id | dateIndexed | datePublished | dateUpdated | requestRaw |
|---:|:-------------------------------------|:---------------------|:-----------------------------------------------------|:-------------------------|----------------:|------------------:|---------------------:|--------------------:|-------------------:|------------------:|:---------------------|:-----------------------------|:-----------------------------|:-----------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | 13707bc3-434e-4d4b-9b55-510cbe70fae2 | AYXweEvEEs8gbQkuEVT1 | ['Cancer - Binary Text Classification'] | ['AYXweEvEEs8gbQkuEVT1'] | 94 | 87 | 13 | 13 | 0 | 0 | AYXwhMcUEs8gbQkuEVVo | 2023-01-26T23:58:31.699+0000 | 2023-01-26T23:58:31.699+0000 | 2023-01-26T23:58:34.232+0000 | nan |
| 2 | 5ad1f4d2-e987-4935-9886-329b89a4f137 | AYXweEvEEs8gbQkuEVT1 | ['Heme - Binary Text Classification'] | ['AYXweEvEEs8gbQkuEVT1'] | 94 | 0 | 94 | 94 | 0 | 0 | AYXwhFrSEs8gbQkuEVUG | 2023-01-26T23:58:03.985+0000 | 2023-01-26T23:58:03.985+0000 | 2023-01-26T23:58:06.342+0000 | nan |
Updated about 1 month ago
In this next tutorial, you will learn how to retrieve the predictions generated for your job!