HomeGuidesRecipesAPI EndpointsRelease NotesCommunity
Log In

ChemProp Molecular Property Prediction

Learn how to create a project, run a model, and predict bioactivity based on an input of SMILE strings

ChemProp, developed by MIT's CSAIL department, is a message passing neural network (MPNN) that takes input SMILE strings and generates predictions for certain molecular properties. We've provided two of the available ChemProp models directly in Layar, and they can be used in your own analyses as well! A little bit about each model:

Antibiotics Model
Trained to predict the probability that a molecule will inhibit the growth of E. coli. This model was trained on the data from A Deep Learning Approach to Antibiotic Discovery (processed data is here). The model is an ensemble of 20 chemprop models augmented with RDKit features and with optimized hyperparameters.

SARS Model
Trained to predict the probability that a molecule will inhibit the 3CL protease of SARS-CoV. This model was trained on the data from PubChem assay AID1706 (processed data is here). The model is an ensemble of 5 chemprop models augmented with RDKit features.

For this tutorial, we will be using the SARS model.

🚧

Tutorial for Python SDK

This tutorial uses the Layar Python SDK. If using the RESTful API, please adjust the script accordingly.

Part 1. Set Up

Import Dependencies

To start, let's upload our dependencies.

# add dependencies
from __future__ import print_function
import layar_api
from layar_api.rest import ApiException
from pprint import pprint
import pandas as pd
from io import StringIO

Configure Authentication

Next, we'll want to configure our session with our authentication keys. Copy the following commands and only swap out the strings for base_host, client_id, and client_secret. The base_host is the Layar instance you're working within (e.g. 'demo.vyasa.com'), and the client ID and secret are your provided authentication keys.

To learn how to get your authentication keys, please reference this document or use the recipe below.

# set up your authentication credentials
base_host = 'BASE_URL' # your Layar instance (e.g. 'demo.vyasa.com')
client_id = 'AbcDEfghI3' # example developer API key
client_secret = '1ab23c4De6fGh7Ijkl8mNoPq9' #example developer API secret

# configure oauth access token for authorization
configuration = layar_api.Configuration()
configuration.host = f"https://{base_host}"
configuration.access_token = configuration.fetch_access_token(
    client_id, client_secret)

# Make your life easier for the next task: instantiating APIs!
client = layar_api.ApiClient(configuration)

Instantiate Your APIs

Finally, we'll want to instantiate all the API classes we will be calling in following steps.

# Instantiate APIs
module = layar_api.ModuleApi(client)
model = layar_api.ModelApi(client)
project = layar_api.ProjectApi(client)
computation = layar_api.ProjectComputationApi(client)
results = layar_api.SourceDocumentApi(client)

Part 2. Create a Project

Find the module ID

Using the search_modules endpoint from the ModuleApi, you will find the module of interest and its respective Layar ID.

Reminder: a module is a collection of deployed model iterations (like a collection of "versions" for a model).

## Find the module ID
rows = 56 # int | the number of rows to return (optional)
start = 56 # int | the start offset for the row (optional)
q = 'ChemProp Molecular Property Prediction' # str | the query string to search for (optional)

try:
    api_response = module.search_modules(rows=rows, start=start, q=q)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ModuleApi->search_modules: %s\n" % e)

You should expect to see a response body similar to this example below, which shows the module's object and all of the metadata about this module:

[{'date_indexed': datetime.datetime(2021, 6, 26, 8, 9, 30, 806000, tzinfo=tzutc()),
 'description': 'Chemprop is a message passing neural network for molecular '
                'property prediction.',
 'enabled': True,
 'id': 'AXpHXdm3PTGp3VJdtpu5',
 'input_types': ['TABLE', 'SDF'],
 'module_type': 'chem_prop',
 'name': 'ChemProp Molecular Property Prediction'}]

The value of interest is the id, which is the module's Layar ID. In this case, our module's id is AXpHXdm3PTGp3VJdtpu5.

πŸ“˜

Layar IDs Are Unique to Each Instance

Each Layar instance will assign their own unique Layar IDs, so a module in example.vyasa.com will have a different Layar ID than the same module in demo.vyasa.com.

Get a list of the published models for this module

Next, we want to find all of the models published under this module, using the search_models_by_module_id endpoint in the ModelApi.

Use the module_id you had found in the previous step.

## Find models by module ID
module_id = 'AXpHXdm3PTGp3VJdtpu5' # str | 
start = 56 # int | the start offset for the row (optional)
rows = 56 # int | the number of rows to return (optional)
q = 'q_example' # str | the query string to search for (optional)

try:
    api_response = model.search_models_by_module_id(module_id)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ModelApi->search_models_by_module_id: %s\n" % e)

You should expect to see a response body similar to this example below, which shows the all of the models within the module, as well as all of the metadata for each model. For ChemProp, you should expect to see two models ("SARS" and "Antibiotics") in your list:

[{'date_indexed': datetime.datetime(2021, 6, 26, 8, 9, 30, 905000, tzinfo=tzutc()),
 'description': None,
 'id': 'AXpHXdoaPTGp3VJdtpu7',
 'input_url': None,
 'module_id': 'AXpHXdm3PTGp3VJdtpu5',
 'name': 'SARS',
 'project_computation_id': None,
 'project_id': None,
 'url': 's3://cortex-resource-data/chem_prod/chemprop_model_sars.pt'},
 {'date_indexed': datetime.datetime(2021, 6, 26, 8, 9, 30, 894000, tzinfo=tzutc()),
 'description': None,
 'id': 'AXpHXdoQPTGp3VJdtpu6',
 'input_url': None,
 'module_id': 'AXpHXdm3PTGp3VJdtpu5',
 'name': 'Antibiotics',
 'project_computation_id': None,
 'project_id': None,
 'url': 's3://cortex-resource-data/chem_prod/chemprop_model_antibiotics.pt'}]

We'll use the SARS model for this tutorial. Save that model's Layar ID (AXpHXdoaPTGp3VJdtpu7), as we will use it later on.

Create a project with the module ID

Finally, we're going to create a project using the create_project endpoint within the ProjectApi. Provide your project with a name and description (optional), and then identify which module you will be using via the module_id.

## create a project with the module ID
body = layar_api.Project( 
    name = 'ChemProp Example Project',
    description = 'This is a demo project description.',
    module_id = 'AXpHXdm3PTGp3VJdtpu5')

try:
    # Create a new project
    api_response = project.create_project(body)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ProjectApi->create_project: %s\n" % e)

You should expect to see a response body similar to this example below, which shows the details of the newly created project object:

{'created_by_user': 25007,
 'date_indexed': None,
 'description': 'This is a demo project description.',
 'id': 'AYMeU0-pW79yjX_cxPsZ',
 'module': None,
 'module_id': 'AXpHXdm3PTGp3VJdtpu5',
 'name': 'ChemProp Example Project'}

Remember this id ('AYMeU0-pW79yjX_cxPsZ'), as we'll need it later. This is the Project's Layar ID.

Now, you're ready to start a project computation!

Part 3. Start a Project Computation Job

Provide a list of inputs for ChemFormer

When using the ChemFormer, you'll want to provide an input of n number of molecules as SMILE strings (learn more about SMILE strings here).

For this tutorial, we've provided you with a sample list of SMILE strings for small molecules (n=14). This data was pulled from ChEMBL:

# Run Project Computation Job
smilesStrings = [
    "CCCCCCCCCCCCCCCO",
    "O=P([O-])([O-])OC1C(O)C(O)C(O)C(O)C1O",
    "O=C1c2ccc(O)cc2OCC1c1ccc(O)cc1O",
    "COc1ccc(-c2cc(=O)c3c(O)cc([O-])cc3o2)cc1",
    "[O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl",
    "C=C(C)C1CCC2(C1)C(C)=CC(O)CC2C",
    "COc1cc2c(cc1O)C1Cc3ccc4c(c3CN1CC2)OCO4",
    "[NH3+]C(CC1C=CC(O)CC1)C(=O)[O-]",
    "C1CC[NH2+]CC1",
    "[O-]c1c(O)c(Cl)cc(Cl)c1Cl",
    "[O-][n+]1ccccc1",
    "C[NH2+]C(C)C(O)c1ccccc1",
    "CC12CCC3c4ccc(O)cc4C(O)CC3C1CCC2O",
    "Oc1cc(O)cc(CCc2ccccc2)c1"
    ]

Convert your list of SMILE strings into a single text string, with each SMILE string separated by a new line.

smilesStringsAsText = '\n'.join(smilesStrings)

Alternatively, you can point to a text file with a list of SMILE strings in the same format (each SMILE on a new line).

Kick off a project computation

Using the run_project endpoint in the ProjectsApi, start a computation run using on a model within your project's module.

For us, we will be using only one of the models within the ChemProp module (model: SARS). Remember the model ID: AXpHXdoaPTGp3VJdtpu7? We're using it now! We're also using the Project ID from earlier as well.

Each model has their own computation_parameters, so check with your model to see what to include in this dictionary. For the SARS model in ChemProp, the parameters only require the input SMILE strings (smiles_raw).

# Run Project Computation Job
body = layar_api.Body6(
    model_id = 'AXpHXdoaPTGp3VJdtpu7',
    computation_parameters = {
        "smiles_raw": smilesStringsAsText, # input SMILE strings
        "number_molecules": 30, # number of molecules to generate
        "task":"interpolate_smiles" # task in model to perform
        }
    )
id = 'AYMeU0-pW79yjX_cxPsZ' # Project ID 

try:
    api_response = project.run_project(body, id)
    pprint(api_response)
except ApiException as e:
    print("Exception when calling ProjectApi->run_project: %s\n" % e)

You should expect to see a response that details the computation job! This will include the mode (), the parameters used, the computation job's Layar ID (id), and the status of the job at the time of submission!

{'computation_mode': 'EXECUTING',
 'computation_parameters': {'number_molecules': 30,
                            'smiles_raw': 'CCCCCCCCCCCCCCCO\n'
                                          'O=P([O-])([O-])OC1C(O)C(O)C(O)C(O)C1O\n'
                                          'O=C1c2ccc(O)cc2OCC1c1ccc(O)cc1O\n'
                                          'COc1ccc(-c2cc(=O)c3c(O)cc([O-])cc3o2)cc1\n'
                                          '[O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl\n'
                                          'C=C(C)C1CCC2(C1)C(C)=CC(O)CC2C\n'
                                          'COc1cc2c(cc1O)C1Cc3ccc4c(c3CN1CC2)OCO4\n'
                                          '[NH3+]C(CC1C=CC(O)CC1)C(=O)[O-]\n'
                                          'C1CC[NH2+]CC1\n'
                                          '[O-]c1c(O)c(Cl)cc(Cl)c1Cl\n'
                                          '[O-][n+]1ccccc1\n'
                                          'C[NH2+]C(C)C(O)c1ccccc1\n'
                                          'CC12CCC3c4ccc(O)cc4C(O)CC3C1CCC2O\n'
                                          'Oc1cc(O)cc(CCc2ccccc2)c1',
                            'task': 'interpolate_smiles'},
 'created_by_user': 25007,
 'date_indexed': datetime.datetime(2022, 9, 8, 18, 19, 6, 491000, tzinfo=tzutc()),
 'id': '5d66cdff-2a3b-4a05-88b7-868f6c8043e2',
 'model_id': 'AXpHXdoaPTGp3VJdtpu7',
 'name': None,
 'project_id': 'AYMeU0-pW79yjX_cxPsZ',
 'source_document': None,
 'status': 'RUNNING'}

Save the id as your Project Computation Layar ID, we'll use it to check the status of the job in the next step.

Check the status of the project computation

## Get project computation details
id = '5d66cdff-2a3b-4a05-88b7-868f6c8043e2' # Project Computation ID

try:
    api_response = computation.get_project_computation(id)
    pprint(api_response.status)
except ApiException as e:
    print("Exception when calling ProjectComputationApi->get_project_computation: %s\n" % e)

You'll see that computation job's current status. Once the status of the computation is COMPLETE, proceed to the next step!

'COMPLETE'

Part 4. View Results

Get the results

To gather your results, you'll need to find the generated results document. You can find it by using the search_documents endpoint in the SourceDocumentApi.

We'll be using the Project Computation ID (5d66cdff-2a3b-4a05-88b7-868f6c8043e2) to search for your results.

## get the results from the computation
body = layar_api.SourceDocumentSearchCommand(
    project_computation_id = '5d66cdff-2a3b-4a05-88b7-868f6c8043e2',
    source_fields=['id'])
x_vyasa_data_providers = base_host # The instance where the computation job was run

try:
    api_response = results.search_documents(body, x_vyasa_data_providers)
    for document in api_response: 
                  doc = [
                      document.id
                    ]
    print(doc)
    #pprint(api_response)
except ApiException as e:
    print("Exception when calling SourceDocumentApi->search_documents: %s\n" % e)

Since we asked to only return the document's id, you should only see the ID underneath (rather than the full document's object body, which is a lot!

['AYMeU8vBW79yjX_cxPse']

Download full results file

Using the download_document endpoint in the SourceDocumentApi, identify the Computation ID and download the file accordingly.

We've added some prettyfied formatting with the StringIO package, but this is optional!

## download full results file
id = 'AYMeU8vBW79yjX_cxPse' # Project Computation Job ID | 
try:
    # Download document by ID
    api_response = results.download_document(id)
    data = api_response.rstrip() # To strip the /r/n markup in the file
    df = pd.read_csv(StringIO(data)) # To prettify the table in the terminal
    print(df)
except ApiException as e:
    print("Exception when calling SourceDocumentApi->download_document: %s\n" % e)

Voila! You should see the following table in your terminal!

-                                     smiles  activity
0                           CCCCCCCCCCCCCCCO  0.062236
1      O=P([O-])([O-])OC1C(O)C(O)C(O)C(O)C1O  0.172389
2            O=C1c2ccc(O)cc2OCC1c1ccc(O)cc1O  0.412367
3   COc1ccc(-c2cc(=O)c3c(O)cc([O-])cc3o2)cc1  0.408727
4             [O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl  0.655236
5             C=C(C)C1CCC2(C1)C(C)=CC(O)CC2C  0.100300
6     COc1cc2c(cc1O)C1Cc3ccc4c(c3CN1CC2)OCO4  0.174929
7            [NH3+]C(CC1C=CC(O)CC1)C(=O)[O-]  0.065650
8                              C1CC[NH2+]CC1  0.067222
9                  [O-]c1c(O)c(Cl)cc(Cl)c1Cl  0.535207
10                           [O-][n+]1ccccc1  0.270997
11                   C[NH2+]C(C)C(O)c1ccccc1  0.113226
12         CC12CCC3c4ccc(O)cc4C(O)CC3C1CCC2O  0.105650
13                  Oc1cc(O)cc(CCc2ccccc2)c1  0.180133

πŸ‘

Completed!

You've successfully created a project, used an available model, and generated some outcomes for your analysis. Well done!