Assigning LLMs to GPUs

Introduction

The Layar config file allows you to assign specific models to specific GPUs. This allows users to utilize multiple models depending on the use case.

📘
Plug and Play From HuggingFace
As of version 1.18, you can utilize models directly from HuggingFace. Currently Certara.AI can only use Instruct models.
If you plan on using a model directly from HuggingFace, ensure you are pulling the model using snapshot_download. An example Python script to do so is below.
from huggingface_hub import snapshot_download
snapshot_download(<MODEL-CREATOR>/<MODEL-NAME>, use_auth_token=<AUTH_TOKEN>)

Accessing Layar.Config

SSH into your environment.
Run sudo -i
Run vim /data/layar/layar.config
Press i to start editing

⚠️Be careful not to change anything other than what is outlined in this guide

Assign GPUs to Model Server

TGI_NUM_GPUdictates how many GPUs will be used to service the model server. The amount of GPUs assigned to the model server will determine throughput of the model.

📘
Model GPU Requirements
Before proceeding, please review Model Requirements.

`All_Model_Info`

Dictates which models are used on which GPUs. The value is string containing a JSON dictionaries.

All_Model_Info = '
                  [{"model_name":"mistralai/Mistral-7B-Instruct-v0.1",
                  "model_server_base":"certara-tgi",
                  "model_server_port":"8001",
                  "model_args":"--max-model-len 8192 --gpu-memory-utilization 0.45 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
                  "gpu_ids":[0]
                  },
                  {"model_name":"ibm-granite/granite-3b-code-instruct",
                  "model_server_base":"certara-tgi",
                  "model_server_port":"8001",
                  "model_args":"--max-model-len 2048 --gpu-memory-utilization 0.45 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
                  "gpu_ids":[0]
                  },
                  {"model_name":"casperhansen/llama-3-70b-instruct-awq",
                  "model_server_base":"certara-tgi",
                  "model_server_port":"9001",
                  "model_args":"--max-model-len 8192 --quantization marlin --gpu-memory-utilization 0.97 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
                  "gpu_ids":[1]
                  }]
               '

The above is assuming you have 2 GPUs available to assign to models. Lets go over each individual field in the dictionary.

`model_name`

A string value that contains the full name of the model. For example, the default mixtral model full name is casperhansen/mixtral-instruct-awq

`model_server_base`

A string value that contains the name of the Certara model server. This should always be certara-tgi.

`model_server_port`

The port number the model will reside at. If you want to use 2 models on the same GPUs, the port would be the same for each model. For example, Mistral-7B-Instruct-v0.1 and granite-3b-code-instructshare the same GPU ids so they require the same port number. If the model is going to be placed on its own GPU, you will need to change the port number. In the above case llama-3-70b-instruct-awqis put on port 9001 because the 8000 range is being taken already.

`model_args`

A string value that contains the model arguments to be used for that specific model.

📘
Important Considerations
If you are planning on using a GPU to service multiple models, you must include the gpu-memory-utilizationargument. These value must be less than or equal to 1. In the example above, the Mistral and Granite model are given 45 percent of the GPUs memory.
In addition, you will

`gpu_ids`

A list of integers. Each value will correlate back to a specific GPU. Layar will take up whichever GPUs it's able to utilize when the application stack starts, up to the number of IDs you provided in the configuration.