Lean GPU Configuration Considerations
LLM Models can run on smaller GPUs. This guide will go over what arguments are needed in the helm chart in order to allows the model to run on multiple small GPUs.
Model Arguments
For this guide, we will go over the arguments needed for Llama 3 to run on 4A10 GPUs. The below argument string can be used.
MODEL_ARGS: "--max-model-len 8192 --quantization awq --dtype half --gpu-memory-utilization 0.99 --trust-remote-code --enforce-eager --kv-cache-dtype fp8
Layar Config Changes
In addition to model arguments, the Layar config variables need to be changed. In this example, We'll be assigning 2 GPUs to be used as server workers.
TGI_MODEL: casperhansen/llama-3-70b-instruct-awq
VLLM_GPU_COUNT: 2
TGI_NUM_GPU: 2
TGI_GPU_INSTANCE: "nvidia.com/gpu"
Utilizing MiG
MiG can be used to also give more GPUs to the LLM server workers. When you provide the instance name for TGI_GPU_INSTANCE
you need to provide the specific instance you want to use, IE TGI_GPU_INSTANCE: "nvidia.com/mig-3g.20gb"
.
Updated about 2 months ago