ChemProp Molecular Property Prediction
Learn how to create a project, run a model, and predict bioactivity based on an input of SMILE strings
ChemProp, developed by MIT's CSAIL department, is a message passing neural network (MPNN) that takes input SMILE strings and generates predictions for certain molecular properties. We've provided two of the available ChemProp models directly in Layar, and they can be used in your own analyses as well! A little bit about each model:
Antibiotics Model
Trained to predict the probability that a molecule will inhibit the growth of E. coli. This model was trained on the data from A Deep Learning Approach to Antibiotic Discovery (processed data is here). The model is an ensemble of 20 chemprop models augmented with RDKit features and with optimized hyperparameters.
SARS Model
Trained to predict the probability that a molecule will inhibit the 3CL protease of SARS-CoV. This model was trained on the data from PubChem assay AID1706 (processed data is here). The model is an ensemble of 5 chemprop models augmented with RDKit features.
For this tutorial, we will be using the SARS model.
Tutorial for Python SDK
This tutorial uses the Layar Python SDK. If using the RESTful API, please adjust the script accordingly.
Part 1. Set Up
Import Dependencies
To start, let's upload our dependencies.
# add dependencies
from __future__ import print_function
import layar_api
from layar_api.rest import ApiException
from pprint import pprint
import pandas as pd
from io import StringIO
Configure Authentication
Next, we'll want to configure our session with our authentication keys. Copy the following commands and only swap out the strings for base_host
, client_id
, and client_secret
. The base_host
is the Layar instance you're working within (e.g. 'demo.vyasa.com'
), and the client ID and secret are your provided authentication keys.
To learn how to get your authentication keys, please reference this document or use the recipe below.
# set up your authentication credentials
base_host = 'BASE_URL' # your Layar instance (e.g. 'demo.vyasa.com')
client_id = 'AbcDEfghI3' # example developer API key
client_secret = '1ab23c4De6fGh7Ijkl8mNoPq9' #example developer API secret
# configure oauth access token for authorization
configuration = layar_api.Configuration()
configuration.host = f"https://{base_host}"
configuration.access_token = configuration.fetch_access_token(
client_id, client_secret)
# Make your life easier for the next task: instantiating APIs!
client = layar_api.ApiClient(configuration)
Instantiate Your APIs
Finally, we'll want to instantiate all the API classes we will be calling in following steps.
# Instantiate APIs
module = layar_api.ModuleApi(client)
model = layar_api.ModelApi(client)
project = layar_api.ProjectApi(client)
computation = layar_api.ProjectComputationApi(client)
results = layar_api.SourceDocumentApi(client)
Part 2. Create a Project
Find the module ID
Using the search_modules endpoint from the ModuleApi, you will find the module of interest and its respective Layar ID.
Reminder: a module is a collection of deployed model iterations (like a collection of "versions" for a model).
## Find the module ID
rows = 56 # int | the number of rows to return (optional)
start = 56 # int | the start offset for the row (optional)
q = 'ChemProp Molecular Property Prediction' # str | the query string to search for (optional)
try:
api_response = module.search_modules(rows=rows, start=start, q=q)
pprint(api_response)
except ApiException as e:
print("Exception when calling ModuleApi->search_modules: %s\n" % e)
You should expect to see a response body similar to this example below, which shows the module's object and all of the metadata about this module:
[{'date_indexed': datetime.datetime(2021, 6, 26, 8, 9, 30, 806000, tzinfo=tzutc()),
'description': 'Chemprop is a message passing neural network for molecular '
'property prediction.',
'enabled': True,
'id': 'AXpHXdm3PTGp3VJdtpu5',
'input_types': ['TABLE', 'SDF'],
'module_type': 'chem_prop',
'name': 'ChemProp Molecular Property Prediction'}]
The value of interest is the id
, which is the module's Layar ID. In this case, our module's id is AXpHXdm3PTGp3VJdtpu5
.
Layar IDs Are Unique to Each Instance
Each Layar instance will assign their own unique Layar IDs, so a module in example.vyasa.com will have a different Layar ID than the same module in demo.vyasa.com.
Get a list of the published models for this module
Next, we want to find all of the models published under this module, using the search_models_by_module_id endpoint in the ModelApi.
Use the module_id
you had found in the previous step.
## Find models by module ID
module_id = 'AXpHXdm3PTGp3VJdtpu5' # str |
start = 56 # int | the start offset for the row (optional)
rows = 56 # int | the number of rows to return (optional)
q = 'q_example' # str | the query string to search for (optional)
try:
api_response = model.search_models_by_module_id(module_id)
pprint(api_response)
except ApiException as e:
print("Exception when calling ModelApi->search_models_by_module_id: %s\n" % e)
You should expect to see a response body similar to this example below, which shows the all of the models within the module, as well as all of the metadata for each model. For ChemProp, you should expect to see two models ("SARS" and "Antibiotics") in your list:
[{'date_indexed': datetime.datetime(2021, 6, 26, 8, 9, 30, 905000, tzinfo=tzutc()),
'description': None,
'id': 'AXpHXdoaPTGp3VJdtpu7',
'input_url': None,
'module_id': 'AXpHXdm3PTGp3VJdtpu5',
'name': 'SARS',
'project_computation_id': None,
'project_id': None,
'url': 's3://cortex-resource-data/chem_prod/chemprop_model_sars.pt'},
{'date_indexed': datetime.datetime(2021, 6, 26, 8, 9, 30, 894000, tzinfo=tzutc()),
'description': None,
'id': 'AXpHXdoQPTGp3VJdtpu6',
'input_url': None,
'module_id': 'AXpHXdm3PTGp3VJdtpu5',
'name': 'Antibiotics',
'project_computation_id': None,
'project_id': None,
'url': 's3://cortex-resource-data/chem_prod/chemprop_model_antibiotics.pt'}]
We'll use the SARS model for this tutorial. Save that model's Layar ID (AXpHXdoaPTGp3VJdtpu7
), as we will use it later on.
Create a project with the module ID
Finally, we're going to create a project using the create_project endpoint within the ProjectApi. Provide your project with a name
and description
(optional), and then identify which module you will be using via the module_id
.
## create a project with the module ID
body = layar_api.Project(
name = 'ChemProp Example Project',
description = 'This is a demo project description.',
module_id = 'AXpHXdm3PTGp3VJdtpu5')
try:
# Create a new project
api_response = project.create_project(body)
pprint(api_response)
except ApiException as e:
print("Exception when calling ProjectApi->create_project: %s\n" % e)
You should expect to see a response body similar to this example below, which shows the details of the newly created project object:
{'created_by_user': 25007,
'date_indexed': None,
'description': 'This is a demo project description.',
'id': 'AYMeU0-pW79yjX_cxPsZ',
'module': None,
'module_id': 'AXpHXdm3PTGp3VJdtpu5',
'name': 'ChemProp Example Project'}
Remember this id
('AYMeU0-pW79yjX_cxPsZ'), as we'll need it later. This is the Project's Layar ID.
Now, you're ready to start a project computation!
Part 3. Start a Project Computation Job
Provide a list of inputs for ChemFormer
When using the ChemFormer, you'll want to provide an input of n number of molecules as SMILE strings (learn more about SMILE strings here).
For this tutorial, we've provided you with a sample list of SMILE strings for small molecules (n=14). This data was pulled from ChEMBL:
# Run Project Computation Job
smilesStrings = [
"CCCCCCCCCCCCCCCO",
"O=P([O-])([O-])OC1C(O)C(O)C(O)C(O)C1O",
"O=C1c2ccc(O)cc2OCC1c1ccc(O)cc1O",
"COc1ccc(-c2cc(=O)c3c(O)cc([O-])cc3o2)cc1",
"[O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl",
"C=C(C)C1CCC2(C1)C(C)=CC(O)CC2C",
"COc1cc2c(cc1O)C1Cc3ccc4c(c3CN1CC2)OCO4",
"[NH3+]C(CC1C=CC(O)CC1)C(=O)[O-]",
"C1CC[NH2+]CC1",
"[O-]c1c(O)c(Cl)cc(Cl)c1Cl",
"[O-][n+]1ccccc1",
"C[NH2+]C(C)C(O)c1ccccc1",
"CC12CCC3c4ccc(O)cc4C(O)CC3C1CCC2O",
"Oc1cc(O)cc(CCc2ccccc2)c1"
]
Convert your list of SMILE strings into a single text string, with each SMILE string separated by a new line.
smilesStringsAsText = '\n'.join(smilesStrings)
Alternatively, you can point to a text file with a list of SMILE strings in the same format (each SMILE on a new line).
Kick off a project computation
Using the run_project endpoint in the ProjectsApi, start a computation run using on a model within your project's module.
For us, we will be using only one of the models within the ChemProp module (model: SARS). Remember the model ID: AXpHXdoaPTGp3VJdtpu7
? We're using it now! We're also using the Project ID from earlier as well.
Each model has their own computation_parameters
, so check with your model to see what to include in this dictionary. For the SARS model in ChemProp, the parameters only require the input SMILE strings (smiles_raw
).
# Run Project Computation Job
body = layar_api.Body6(
model_id = 'AXpHXdoaPTGp3VJdtpu7',
computation_parameters = {
"smiles_raw": smilesStringsAsText, # input SMILE strings
"number_molecules": 30, # number of molecules to generate
"task":"interpolate_smiles" # task in model to perform
}
)
id = 'AYMeU0-pW79yjX_cxPsZ' # Project ID
try:
api_response = project.run_project(body, id)
pprint(api_response)
except ApiException as e:
print("Exception when calling ProjectApi->run_project: %s\n" % e)
You should expect to see a response that details the computation job! This will include the mode
(), the parameters
used, the computation job's Layar ID (id
), and the status
of the job at the time of submission!
{'computation_mode': 'EXECUTING',
'computation_parameters': {'number_molecules': 30,
'smiles_raw': 'CCCCCCCCCCCCCCCO\n'
'O=P([O-])([O-])OC1C(O)C(O)C(O)C(O)C1O\n'
'O=C1c2ccc(O)cc2OCC1c1ccc(O)cc1O\n'
'COc1ccc(-c2cc(=O)c3c(O)cc([O-])cc3o2)cc1\n'
'[O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl\n'
'C=C(C)C1CCC2(C1)C(C)=CC(O)CC2C\n'
'COc1cc2c(cc1O)C1Cc3ccc4c(c3CN1CC2)OCO4\n'
'[NH3+]C(CC1C=CC(O)CC1)C(=O)[O-]\n'
'C1CC[NH2+]CC1\n'
'[O-]c1c(O)c(Cl)cc(Cl)c1Cl\n'
'[O-][n+]1ccccc1\n'
'C[NH2+]C(C)C(O)c1ccccc1\n'
'CC12CCC3c4ccc(O)cc4C(O)CC3C1CCC2O\n'
'Oc1cc(O)cc(CCc2ccccc2)c1',
'task': 'interpolate_smiles'},
'created_by_user': 25007,
'date_indexed': datetime.datetime(2022, 9, 8, 18, 19, 6, 491000, tzinfo=tzutc()),
'id': '5d66cdff-2a3b-4a05-88b7-868f6c8043e2',
'model_id': 'AXpHXdoaPTGp3VJdtpu7',
'name': None,
'project_id': 'AYMeU0-pW79yjX_cxPsZ',
'source_document': None,
'status': 'RUNNING'}
Save the id
as your Project Computation Layar ID, we'll use it to check the status of the job in the next step.
Check the status of the project computation
## Get project computation details
id = '5d66cdff-2a3b-4a05-88b7-868f6c8043e2' # Project Computation ID
try:
api_response = computation.get_project_computation(id)
pprint(api_response.status)
except ApiException as e:
print("Exception when calling ProjectComputationApi->get_project_computation: %s\n" % e)
You'll see that computation job's current status
. Once the status of the computation is COMPLETE
, proceed to the next step!
'COMPLETE'
Part 4. View Results
Get the results
To gather your results, you'll need to find the generated results document. You can find it by using the search_documents endpoint in the SourceDocumentApi.
We'll be using the Project Computation ID (5d66cdff-2a3b-4a05-88b7-868f6c8043e2
) to search for your results.
## get the results from the computation
body = layar_api.SourceDocumentSearchCommand(
project_computation_id = '5d66cdff-2a3b-4a05-88b7-868f6c8043e2',
source_fields=['id'])
x_vyasa_data_providers = base_host # The instance where the computation job was run
try:
api_response = results.search_documents(body, x_vyasa_data_providers)
for document in api_response:
doc = [
document.id
]
print(doc)
#pprint(api_response)
except ApiException as e:
print("Exception when calling SourceDocumentApi->search_documents: %s\n" % e)
Since we asked to only return the document's id
, you should only see the ID underneath (rather than the full document's object body, which is a lot!
['AYMeU8vBW79yjX_cxPse']
Download full results file
Using the download_document endpoint in the SourceDocumentApi, identify the Computation ID and download the file accordingly.
We've added some prettyfied formatting with the StringIO
package, but this is optional!
## download full results file
id = 'AYMeU8vBW79yjX_cxPse' # Project Computation Job ID |
try:
# Download document by ID
api_response = results.download_document(id)
data = api_response.rstrip() # To strip the /r/n markup in the file
df = pd.read_csv(StringIO(data)) # To prettify the table in the terminal
print(df)
except ApiException as e:
print("Exception when calling SourceDocumentApi->download_document: %s\n" % e)
Voila! You should see the following table in your terminal!
- smiles activity
0 CCCCCCCCCCCCCCCO 0.062236
1 O=P([O-])([O-])OC1C(O)C(O)C(O)C(O)C1O 0.172389
2 O=C1c2ccc(O)cc2OCC1c1ccc(O)cc1O 0.412367
3 COc1ccc(-c2cc(=O)c3c(O)cc([O-])cc3o2)cc1 0.408727
4 [O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl 0.655236
5 C=C(C)C1CCC2(C1)C(C)=CC(O)CC2C 0.100300
6 COc1cc2c(cc1O)C1Cc3ccc4c(c3CN1CC2)OCO4 0.174929
7 [NH3+]C(CC1C=CC(O)CC1)C(=O)[O-] 0.065650
8 C1CC[NH2+]CC1 0.067222
9 [O-]c1c(O)c(Cl)cc(Cl)c1Cl 0.535207
10 [O-][n+]1ccccc1 0.270997
11 C[NH2+]C(C)C(O)c1ccccc1 0.113226
12 CC12CCC3c4ccc(O)cc4C(O)CC3C1CCC2O 0.105650
13 Oc1cc(O)cc(CCc2ccccc2)c1 0.180133
Completed!
You've successfully created a project, used an available model, and generated some outcomes for your analysis. Well done!
Updated 9 months ago