Adjusting vllm_attention

Introduction

Some models will require a different vllm_attention_backend to run efficently, this guide goes over how to configure that in the layar.config file.

Editing Layar.config

SSH into your environment.
Run sudo -i
Run vim /data/layar/layar.config
Press i to start editing.

The ALL_MODEL_INFO needs to include vllm_attention_backend in the JSON. Example for GPT-OSS is below.

ALL_MODEL_INFO = [{"model_name":"openai/gpt-oss-120b", "model_server_base":"certara_model_server","model_server_port":"9001", "max_model_len":"128000", "gpu_memory_utilization":"0.90","vllm_attention_backend":"MARLIN"}]