GuidesRecipesAPI EndpointsRelease NotesCommunity
Log In
Guides

Adjusting vllm_attention_backend

Introduction

Some models will require a different vllm_attention_backend to run efficently, this guide goes over how to configure that in the layar.config file.

Editing Layar.config

  1. SSH into your environment.
  2. Run sudo -i
  3. Run vim /data/layar/layar.config
  4. Press i to start editing.
  5. The ALL_MODEL_INFO needs to include vllm_attention_backend in the JSON. Example for GPT-OSS is below.
    ALL_MODEL_INFO = [{"model_name":"openai/gpt-oss-120b", "model_server_base":"certara_model_server","model_server_port":"9001", "max_model_len":"128000", "gpu_memory_utilization":"0.90","vllm_attention_backend":"MARLIN"}]