Adjusting vllm_attention_backend
Introduction
Some models will require a different vllm_attention_backend to run efficently, this guide goes over how to configure that in the layar.config file.
Editing Layar.config
- SSH into your environment.
- Run sudo -i
- Run vim /data/layar/layar.config
- Press i to start editing.
- The ALL_MODEL_INFO needs to include
vllm_attention_backendin the JSON. Example for GPT-OSS is below.ALL_MODEL_INFO = [{"model_name":"openai/gpt-oss-120b", "model_server_base":"certara_model_server","model_server_port":"9001", "max_model_len":"128000", "gpu_memory_utilization":"0.90","vllm_attention_backend":"MARLIN"}]
Updated 5 days ago
