See How the Top LLMs Stack Up
LLM Leaderboard
Knowing the best LLMs is key to building the best AI applications, but evaluating them is a daunting task. We've consolidated StackAI expert insights on custom dashboards as a resource for anyone building with AI.
MMLU Benchmarks
Comprehensive testing across 57 subjects including mathematics, history, law, and medicine to evaluate LLM knowledge breadth.
HumanEval+
Extended version of HumanEval with more complex programming challenges across multiple languages to test code quality.
GPQA Evaluation
Graduate-level expert knowledge evaluation designed to test advanced reasoning in specialized domains.
MT-Bench Analysis
Multi-turn benchmarking that evaluates conversation abilities, reasoning, and instruction following across complex dialogues.
SWE Benchmarks
Software engineering tests including code generation, debugging, and algorithm design to measure programming capabilities.
GSM8K Reasoning
Grade school math word problems requiring multi-step reasoning to evaluate logical thinking and problem-solving capabilities.
Provider
Model Name
MMLU Score
Parameters
Context Size
Price
AI21
Jamba Large 1.7
N/A
N/A
256,000
3.5
AI21
Jamba Mini 2
N/A
N/A
256,000
0.25
Amazon
Nova 2 Lite
80.9%
N/A
1,000,000
0.85
Amazon
Nova 2 Omni
N/A
N/A
1,000,000
0.85
Amazon
Nova 2 Pro
N/A
N/A
1,000,000
3.438
Amazon
Nova 2 Sonic
N/A
N/A
1,000,000
0.935
Amazon
Nova Premier
N/A
N/A
1,000,000
5
Amazon
Nova Pro
69.1%
N/A
300,000
1.4
Anthropic
Claude 3.5 Haiku
63.4%
N/A
200,000
1.6
Anthropic
Claude 3.7 Sonnet
80.3%
N/A
200,000
7
Top LLM Models by MMLU Score
The top LLM for reasoning and problem solving, with a focus on grade school math word problems.
Fastest LLM Models by Throughput
The fastest LLMs ranked by tokens processed per second, measuring raw processing speed and efficiency.
Most Cost-Effective LLM Models
The most affordable LLMs ranked by cost per token, helping you optimize your budget without compromising quality.
Longest Context Window
Maximum number of tokens a model can process in a single input
Observe how different processing speeds affect real-time token generation.
1200
t/s
The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.
200
t/s
The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.
40
t/s
The quick brown fox jumps over the lazy dog. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. As the sun sets on the horizon, the forest comes alive with the sounds of nature, creating a symphony of rustling leaves and gentle breezes. The fox pauses, alert to these changes, its ears perked up to catch every subtle noise in the surroundings.
Values reset every 5 seconds to demonstrate different speeds
Compare any two LLM models side by side across different metrics, including MMLU, GPQA, HumanEval, DROP, Context Size, Parameters, Input Price, Output Price, Inference Speed, Throughput, and Latency.
Metric
Provider
MMLU Score
GPQA Score
HumanEval Score
Context Size
Parameters
Input Price
Throughput
Latency
Claude 3.5 Haiku
Anthropic
63.4%
40.8%
75.6%
200,000
N/A
0.8
49.093
0.689
Claude 3.7 Sonnet
Anthropic
80.3%
65.6%
92.1%
200,000
N/A
3
N/A
N/A
