EXPERT INSIGHTS

LLM Leaderboard

Knowing the best LLMs is key to building the best AI applications, but evaluating them is a daunting task. We've consolidated StackAI expert insights on custom dashboards as a resource for anyone building with AI.

MMLU Benchmarks

Comprehensive testing across 57 subjects including mathematics, history, law, and medicine to evaluate LLM knowledge breadth.

HumanEval+

Extended version of HumanEval with more complex programming challenges across multiple languages to test code quality.

GPQA Evaluation

Graduate-level expert knowledge evaluation designed to test advanced reasoning in specialized domains.

MT-Bench Analysis

Multi-turn benchmarking that evaluates conversation abilities, reasoning, and instruction following across complex dialogues.

SWE Benchmarks

Software engineering tests including code generation, debugging, and algorithm design to measure programming capabilities.

GSM8K Reasoning

Grade school math word problems requiring multi-step reasoning to evaluate logical thinking and problem-solving capabilities.

All the models in one place, sorted by price, context size, and parameters.

Provider

Model Name

MMLU Score

Parameters

Context Size

Price

AI21

Jamba Large 1.7

N/A

256,000

3.5

AI21

Jamba Mini 2

N/A

256,000

0.25

Amazon

Nova 2 Lite

80.9%

N/A

1,000,000

0.85

Amazon

Nova 2 Omni

N/A

1,000,000

0.85

Amazon

Nova 2 Pro

N/A

1,000,000

3.438

Amazon

Nova 2 Sonic

N/A

1,000,000

0.935

Amazon

Nova Premier

N/A

1,000,000

Amazon

Nova Pro

69.1%

N/A

300,000

1.4

Anthropic

Claude 3.5 Haiku

63.4%

N/A

200,000

1.6

Anthropic

Claude 3.7 Sonnet

80.3%

N/A

200,000

Top models

The best LLMs in the world, sorted by price, context size, and parameters.

Top models

The best LLMs in the world, sorted by price, context size, and parameters.

Top models

The best LLMs in the world, sorted by price, context size, and parameters.

Top LLM Models by MMLU Score

The top LLM for reasoning and problem solving, with a focus on grade school math word problems.

MMLU measures general knowledge and reasoning

Claude 4.7 Opus

MMLU Score

92.8%

GPT-5.5

MMLU Score

92.4%

Gemini 3 Pro

MMLU Score

89.8%

Gemini 3.1 Flash Lite

MMLU Score

89.2%

Claude 4.5 Opus

MMLU Score

88.9%

MMLU measures general knowledge and reasoning

Claude 4.7 Opus

GPT-5.5

Claude 4 Opus

Gemini 3.1 Flash Lite

Claude 4.5 Opus

80%

90%

100%

Fastest LLM Models by Throughput

The fastest LLMs ranked by tokens processed per second, measuring raw processing speed and efficiency.

Tokens processed per second - higher is better

Mercury 2

Throughput

870.9

tokens/s

Inference Speed

ms/tokens

Latency

3.67

Provider

Inception

Granite 4.0 Small

Throughput

448.8

tokens/s

Inference Speed

ms/tokens

Latency

0.55

Provider

IBM

Nemotron 3 Super 120B

Throughput

390.7

tokens/s

Inference Speed

ms/tokens

Latency

0.7

Provider

NVIDIA

GPT-OSS 120B

Throughput

319.541

tokens/s

Inference Speed

ms/tokens

Latency

0.47

Provider

OpenAI

GPT-OSS 20B

Throughput

297.704

tokens/s

Inference Speed

ms/tokens

Latency

0.503

Provider

OpenAI

Tokens processed per second - higher is better

Mercury 2

Granite 4.0 Small

Nemotron 3 Super 120B

GPT-OSS 120B

GPT-OSS 20B

0.0

500

1.000

Most Cost-Effective LLM Models

The most affordable LLMs ranked by cost per token, helping you optimize your budget without compromising quality.

Price per million tokens - lower is better

Gemma 3n E4B

Llama 3.2 1B Instruct

Command R7B

Granite 4.0 Small

Qwen 3.5 9B

$0.000

$0.040

$0.080

Price per million tokens - lower is better

Gemma 3n E4B

Input Price

0.03

$/M

Output Price

0.06

$/M

Effective Price

0.037

$/M

Provider

Google

Llama 3.2 1B Instruct

Input Price

0.053

$/M

Output Price

0.055

$/M

Effective Price

0.053

$/M

Provider

Longest Context Window

Maximum number of tokens a model can process in a single input

While tokenization varies between models, on average, 1 token ≈ 3.5 characters in English

Grok 4 Fast

Grok 4 Heavy

Grok 4.1

Grok 4.20

GPT 4.1

While tokenization varies between models, on average, 1 token ≈ 3.5 characters in English

Grok 4 Fast

Tokens

2,000,000

Grok 4 Heavy

Tokens

2,000,000

Grok 4.1

Tokens

2,000,000

Grok 4.20

Tokens

2,000,000

GPT 4.1

Tokens

1,280,000

Token Generation Speed

Observe how different processing speeds affect real-time token generation.

1200

t/s

The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.

200

t/s

Values reset every 5 seconds to demonstrate different speeds

Compare LLM Models

Compare any two LLM models side by side across different metrics, including MMLU, GPQA, HumanEval, DROP, Context Size, Parameters, Input Price, Output Price, Inference Speed, Throughput, and Latency.

Metric

Provider

MMLU Score

GPQA Score

HumanEval Score

Context Size

Parameters

Input Price

Throughput

Latency

Claude 3.5 Haiku

Anthropic