HomeGuidesRecipesAPI EndpointsRelease NotesCommunity
Log In
Guides

Model Benchmarks

Introduction

This guide goes over the various models that are compatible with Layar, detailing which models perform better than others.

Quantized VS Unquantized

Layar offers compatibility with both quantized and unquantized models. Benchmarking material isn't available for all quantized models. Unquantized models will always outperform quantized versions, as long as you have the necessary hardware.

📘

Model Requirements

You can find the supported models and the hardware requirements here: Mdeol Requirements

Benchmark Metrics

There are various benchmark metrics that need to be understood in order to decide if a model is good for your use case.

Average

This is the average of all the metrics combined.

Instruction-Following Evaluation (IFEVAL)

Scores the model's ability to follow instructions. Good examples of this would be following a specified word limit or populating a response with only requested information. This is relevant in programmatic usages of Layar, since a specific answer may be necessary in order to avoid mishaps in code execution.

Big Bench Hard (BBH)

Scores the model's ability to handle tasks that are difficult for an LLM. For example, the ability for the LLM to use common sense or world knowledge to adjust responses. A good example of this would prompting a model to "Respond like you are medical writer." A model with a high BBH score would be able to distinguish between placebo and non-placebo adverse events without the user having to specify to separate these numbers.

BBH score also includes the ability for the model to understand mathematical reasoning and understand languages.

Mathematics Aptitude Test of Heuristics (MATH)

Scores the model's ability to do high-school level math. This would be relevant for use cases that are reliant on the model's ability to do math.

Graduate-Level Google-Proof Q&A (GPAQ)

Scores the model's ability to understand PHD level concepts. A good example for this would be the ability to understand abstract scientific concepts. This is relevant science heavy use cases. Some concepts may be beyond reach of the LLM, which can result in incorrect responses. This is relevant in RAG use cases where the text may not include all the answers and requires understanding of detailed concepts in order to properly respond.

Multistep Soft Reasoning (MuSR)

Scores the model's long text reasoning. This is relevant for RAG use cases that are using very large documents but also using large prompts as well. Models with lower MuSR may need more robust prompting in order for the model to generate proper responses off over very long documents OR RAG uses cases that utilize a lot of text chunks.

Massive Multitask Language Understanding - Professional (MMLU-Pro)

Scores the models ability to choose the correct answer out of a handful of answers. This metric overlaps with multiple metrics but is relevant if your use case may have a lot of possibly correct answers but only one correct answer. RAG can help ensure the model isn't given the chance to be faced with multiple correct answers.

Model Performance

Let's go over the scores for each available model.

Mistral-7B-Instruct-V0.1

Mixtral-instruct-awq

🚧 The above metrics are for the unquantized version of Mixtral, which is not supported in Layar. Unquantized version will have lower scores.

Llama-3-70b-instruct

🚧 There is not a good source of benchmarking for this model. However, this is an older version of current Llama 3 models. It's highly suggested to go with Llama 3.3.

Meta-Llama-3.1-70B-Instruct

Meta-Llama-3.1-8B-Instruct

Llama-3.3-70B-Instruct

Llama-3.3-8B-Instruct

🚧 There is not a good source of benchmarking for this model. However, this is a lower parameter version of 70B meaning it will perform worse than the higher parameter version.

Granite 3/8/20/34B

🚧 The above benchmarking is for the 8B version of the model. General rule of thumb, the higher number of parameters the model has the better inferencing it will have, which will result in better responses. The higher number of parameters also requires more VRAM in order to utilize the model to its fullest.

📘

Low Parameter VS High Parameter

The higher parameter model will perform better than the lower parameter model. However, this does not mean the lower parameter model will be unable to handle your use cases. Deciding between the two will depend on your hardware available. If you have limited hardware will be better to use a lower parameter model, since it will afford you higher context length and allow you to more easily scale the deployment.