Model Benchmarks

Introduction

This guide reviews the various models compatible with Layar and details which models perform better than others.

Quantized VS Unquantized

A quantized model indicates a compressed version of the model. This allows the model to be deployed on smaller hardware. Certara AI offers compatibility with both quantized and unquantized models. Benchmarking material isn't available for all quantized models. Unquantized models will always outperform quantized versions as long as you have the necessary hardware.

📘
Model Requirements
You can find the supported models and the hardware requirements here: Model Requirements

📘
Low Parameter VS High Parameter
The higher parameter model will perform better than the lower parameter model. However, this does not mean the lower parameter model will be unable to handle your use cases. Deciding between the two will depend on your hardware available. If you have limited hardware will be better to use a lower parameter model, since it will afford you higher context length and allow you to more easily scale the deployment.

Benchmark Metrics

There are various benchmark metrics that need to be understood in order to decide if a model is good for your use case.

Average

This is the average of all the metrics combined.

Instruction-Following Evaluation (IFEVAL)

Scores the model's ability to follow instructions. Examples of this would be following a specified word limit or populating a response with only requested information. This is relevant in programmatic usages of Layar, since a specific answer may be necessary to avoid mishaps in code execution.

Big Bench Hard (BBH)

Scores the model's ability to handle tasks that are difficult for an LLM. For example, the ability of the LLM to use common sense or world knowledge to adjust responses. A good example of this would be prompting a model to "Respond like you are a medical writer." A model with a high BBH score would be able to distinguish between placebo and non-placebo adverse events without the user having to specify to separate these values.

BBH score also includes the model's ability to understand mathematical reasoning and languages.

Mathematics Aptitude Test of Heuristics (MATH)

Scores the model's ability to do high-school-level math. This would be relevant for use cases that rely on the model's ability to do math.

Graduate-Level Google-Proof Q&A (GPAQ)

Scores the model's ability to understand PHD level concepts. A good example of this would be the ability to understand abstract scientific concepts. This is relevant for science-heavy use cases. Some concepts may be beyond the reach of the LLM, which can result in incorrect responses. This is relevant in RAG use cases where the text may not include all the answers and requires an understanding of detailed concepts to respond properly.

Multistep Soft Reasoning (MuSR)

Scores the model's long-text reasoning. This is relevant for RAG use cases that utilize large documents and prompts. Models with lower MuSR may need more robust prompting in order for the model to generate proper responses over very long documents OR RAG uses cases that utilize a lot of text chunks.

Massive Multitask Language Understanding - Professional (MMLU-Pro)

Scores the model's ability to choose the correct answer from a handful of answers. This metric overlaps with multiple metrics but is relevant if your use case may have a lot of possibly correct answers but only one correct answer. RAG can help ensure the model isn't given the chance to be faced with multiple correct answers.

Model Performance

Let's go over the scores for each available model. The percentages indicate the number of tasks that the model was able to complete successfully. The higher the percentage, the better that model did in the benchmark.

Mistral-7B-Instruct-V0.1

Model Name	Average	IFEVAL	BBH	MATH	GPAQ	MuSR	MMLU-Pro
Mistral-7B-Instruct-V0.1	12.77%	44.87%	7.65%	2.27%	0%	6.13%	15.72%

Mixtral-instruct-awq

Model Name	Average	IFEVAL	BBH	MATH	GPAQ	MuSR	MMLU-Pro
Mixtral-instruct-awq	19.67%	23.26%	30.40%	9.37%	9.40%	13.66%	31.90%

🚧 The above metrics are for the unquantized version of Mixtral, which is not supported in Layar. Unquantized version will have lower scores.

Llama-3-70b-instruct

🚧 There is not a good source of benchmarking for this model. However, this is an older version of current Llama 3 models. It's highly suggested to go with Llama 3.3.

Meta-Llama-3.1-70B-Instruct

Model Name	Average	IFEVAL	BBH	MATH	GPAQ	MuSR	MMLU-Pro
Meta-Llama-3.1-70B-Instruct	43.41%	86.69%	55.93%	38.07%	14.21%	17.69%	47.88%

Meta-Llama-3.1-8B-Instruct

Model Name	Average	IFEVAL	BBH	MATH	GPAQ	MuSR	MMLU-Pro
Meta-Llama-3.1-8B-Instruct	28.20%	78.56%	29.89%	19.34%	2.35%	8.41%	30.68%

Llama-3.3-70B-Instruct

Model Name	Average	IFEVAL	BBH	MATH	GPAQ	MuSR	MMLU-Pro
Llama-3.3-70B-Instruct	44.85%	89.98%	56.56%	48.34%	10.51%	15.57%	48.13%

Llama-3.3-8B-Instruct

🚧 There is not a good source of benchmarking for this model. However, this is a lower parameter version of 70B meaning it will perform worse than the higher parameter version.

Granite 3/8/20/34B

Model Name	Average	IFEVAL	BBH	MATH	GPAQ	MuSR	MMLU-Pro
Granite 8B	20.06%	42.21%	26.02%	9.44%	9.51%	8.36%	24.80%

🚧 The above benchmarking is for the 8B version of the model. General rule of thumb, the higher number of parameters the model has the better inferencing it will have, which will result in better responses. The higher number of parameters also requires more VRAM in order to utilize the model to its fullest.