Assigning LLMs to GPUs
Introduction
The Layar config file allows you to assign specific models to specific GPUs. This allows users to utilize multiple models depending on the use case.
Plug and Play From HuggingFaceAs of version 1.18, you can utilize models directly from HuggingFace. Currently Certara.AI can only use Instruct models.
If you plan on using a model directly from HuggingFace, ensure you are pulling the model using snapshot_download. An example Python script to do so is below.
from huggingface_hub import snapshot_download snapshot_download(<MODEL-CREATOR>/<MODEL-NAME>, use_auth_token=<AUTH_TOKEN>)
Accessing Layar.Config
- SSH into your environment.
- Run sudo -i
- Run vim /data/layar/layar.config
- Press i to start editing
⚠️Be careful not to change anything other than what is outlined in this guide
Assign GPUs to Model Server
TGI_NUM_GPUdictates how many GPUs will be used to service the model server. The amount of GPUs assigned to the model server will determine throughput of the model.
Model GPU RequirementsBefore proceeding, please review Model Requirements.
All_Model_Info
All_Model_InfoDictates which models are used on which GPUs. The value is string containing a JSON dictionaries.
All_Model_Info = '
[{"model_name":"mistralai/Mistral-7B-Instruct-v0.1",
"model_server_base":"certara-tgi",
"model_server_port":"8001",
"model_args":"--max-model-len 8192 --gpu-memory-utilization 0.45 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
"gpu_ids":[0]
},
{"model_name":"ibm-granite/granite-3b-code-instruct",
"model_server_base":"certara-tgi",
"model_server_port":"8001",
"model_args":"--max-model-len 2048 --gpu-memory-utilization 0.45 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
"gpu_ids":[0]
},
{"model_name":"casperhansen/llama-3-70b-instruct-awq",
"model_server_base":"certara-tgi",
"model_server_port":"9001",
"model_args":"--max-model-len 8192 --quantization marlin --gpu-memory-utilization 0.97 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
"gpu_ids":[1]
}]
'The above is assuming you have 2 GPUs available to assign to models. Lets go over each individual field in the dictionary.
model_name
model_nameA string value that contains the full name of the model. For example, the default mixtral model full name is casperhansen/mixtral-instruct-awq
model_server_base
model_server_baseA string value that contains the name of the Certara model server. This should always be certara-tgi.
model_server_port
model_server_portThe port number the model will reside at. If you want to use 2 models on the same GPUs, the port would be the same for each model. For example, Mistral-7B-Instruct-v0.1 and granite-3b-code-instructshare the same GPU ids so they require the same port number. If the model is going to be placed on its own GPU, you will need to change the port number. In the above case llama-3-70b-instruct-awqis put on port 9001 because the 8000 range is being taken already.
model_args
model_argsA string value that contains the model arguments to be used for that specific model.
Important ConsiderationsIf you are planning on using a GPU to service multiple models, you must include the
gpu-memory-utilizationargument. These value must be less than or equal to 1. In the example above, the Mistral and Granite model are given 45 percent of the GPUs memory.In addition, you will
gpu_ids
gpu_idsA list of integers. Each value will correlate back to a specific GPU. Layar will take up whichever GPUs it's able to utilize when the application stack starts, up to the number of IDs you provided in the configuration.
Updated 25 days ago
