Assigning Models to GPUs

Introduction

The Layar config file allows you to assign specific models to specific GPUs. This allows users to utilize multiple models depending on the use case.

Accessing Layar.Config

SSH into your environment.
Run sudo -i
Run vim /data/layar/layar.config
Press i to start editing

⚠️Be careful not to change anything other than what is outlined in this guide

Assign GPUs to Model Server

TGI_NUM_GPUdictates how many GPUs will be used to service the model server. The amount of GPUs assigned to the model server will determine throughput of the model.

📘
Model GPU Requirements
Before proceeding, please review Model Requirements.

`All_Model_Info`

Dictates which models are used on which GPUs. The value is string containing a JSON dictionaries.

All_Model_Info = '
                  [{"model_name":"mistralai/Mistral-7B-Instruct-v0.1",
                  "model_server":"http://certara-tgi:9000/v1/completions",
                  "model_args":"--max-model-len 8192 --gpu-memory-utilization 0.45 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
                  "gpu_ids":[0]
                  },
                  {"model_name":"ibm-granite/granite-3b-code-instruct",
                  "model_server":"http://certara-tgi:8000/v1/completions",
                  "model_args":"--max-model-len 2048 --gpu-memory-utilization 0.45 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
                  "gpu_ids":[0]
                  },
                  {"model_name":"casperhansen/llama-3-70b-instruct-awq",
                  "model_server":"http://certara-tgi:7000/v1/completions",
                  "model_args":"--max-model-len 8192 --quantization marlin --gpu-memory-utilization 0.97 --trust-remote-code --enforce-eager --kv-cache-dtype fp8",
                  "gpu_ids":[1]
                  }]
               '

The above is assuming you have 2 GPUs available to assign to models. Lets go over each individual field in the dictionary.

`model_name`

A string value that contains the full name of the model. For example, the default mixtral model full name is casperhansen/mixtral-instruct-awq

`model_server`

A string value that contains the URL of the model server. The port number needs to be different for each model added to the dictionary.

`model_args`

A string value that contains the model arguments to be used for that specific model.

📘
Important Consideration
If you are planning on using a GPU to service multiple models, you must include the gpu-memory-utilizationargument. These value must be less than or equal to 1. In the example above, the Mistral and Granite model are given 45 percent of the GPUs memory.

`gpu_ids`

A list of integers. Each value will correlate back a specific GPU id which can be found by running nvidia-smi. Adding more ids to the list will allow the model to be used across multiple GPUs, increasing prompt throughput.

Assigning Models to GPUs

Introduction

Accessing Layar.Config

Assign GPUs to Model Server

📘
Model GPU Requirements

`All_Model_Info`

`model_name`

`model_server`

`model_args`

📘
Important Consideration

`gpu_ids`

Introduction

Accessing Layar.Config

Assign GPUs to Model Server

📘Model GPU Requirements

All_Model_Info

model_name

model_server

model_args

📘Important Consideration

gpu_ids

📘
Model GPU Requirements

`All_Model_Info`

`model_name`

`model_server`

`model_args`

📘
Important Consideration

`gpu_ids`