Lean GPU Configuration Considerations

LLM Models can run on smaller GPUs. This guide will go over what arguments are needed in the helm chart in order to allows the model to run on multiple small GPUs.

Model Arguments

For this guide, we will go over the arguments needed for Llama 3 to run on 4A10 GPUs. The below argument string can be used.

MODEL_ARGS: "--max-model-len 8192 --quantization awq --dtype half --gpu-memory-utilization 0.99 --trust-remote-code --enforce-eager --kv-cache-dtype fp8

Layar Config Changes

In addition to model arguments, the Layar config variables need to be changed. In this example, We'll be assigning 2 GPUs to be used as server workers.

TGI_MODEL: casperhansen/llama-3-70b-instruct-awq
VLLM_GPU_COUNT: 2
TGI_NUM_GPU: 2
TGI_GPU_INSTANCE: "nvidia.com/gpu"

Utilizing MiG

MiG can be used to also give more GPUs to the LLM server workers. When you provide the instance name for TGI_GPU_INSTANCE you need to provide the specific instance you want to use, IE TGI_GPU_INSTANCE: "nvidia.com/mig-3g.20gb".