AI Model Rankings
Independent benchmark data for 22 large language models. No affiliation, no marketing.
22
Models
9
Providers
6
Benchmarks
7
Open source
Filter by use case
No models match this filter.
| # | Model | Score | MMLU | HumanEval | MATH | GPQA | Arena ELO | Input $/M | Context | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 |
o3
OpenAI
|
93.7
|
91.6% | 96.4% | 97.8% | 87.7% | 1391 | $10.0 | 200K | → |
| 2 |
Claude Opus 4
Anthropic
|
90.7
|
93.2% | 95.6% | 86.0% | 74.2% | 1395 | $15.0 | 200K | → |
| 3 |
Claude Sonnet 4.6
Anthropic
|
88.3
|
92.1% | 94.8% | 83.7% | 69.1% | 1374 | $3.0 | 200K | → |
| 4 |
DeepSeek R1
DeepSeek
|
88.2
|
90.8% | 92.1% | 97.3% | 71.5% | 1358 | $0.55 | 64K | → |
| 5 |
Claude Sonnet 4.5
Anthropic
|
86.9
|
91.7% | 94.2% | 81.5% | 67.4% | 1362 | $3.0 | 200K | → |
| 6 |
Gemini 2.5 Pro
Google DeepMind
|
85.9
|
90.0% | 87.9% | 91.2% | 59.1% | 1380 | $1.25 | 1.0M | → |
| 7 |
Claude Sonnet 4
Anthropic
|
85.4
|
91.0% | 93.5% | 79.2% | 65.8% | 1345 | $3.0 | 200K | → |
| 8 |
GPT-4.1
OpenAI
|
85.2
|
90.2% | 97.1% | 86.5% | 56.8% | 1340 | $2.0 | 1.0M | → |
| 9 |
DeepSeek V3
DeepSeek
|
81.0
|
88.5% | 89.1% | 87.2% | 51.3% | 1302 | $0.27 | 128K | → |
| 10 |
Claude 3.5 Sonnet
Anthropic
|
80.1
|
88.7% | 92.0% | 71.1% | 59.4% | 1289 | $3.0 | 200K | → |
| 11 |
GPT-4o
OpenAI
|
79.5
|
88.7% | 90.2% | 76.6% | 53.6% | 1285 | $5.0 | 128K | → |
| 12 |
Llama 4 Maverick
Meta
|
78.9
|
88.7% | 89.8% | 74.9% | 52.8% | 1285 | $0.19 | 1.0M | → |
| 13 |
Llama 3.1 405B
Meta
|
77.6
|
88.6% | 89.0% | 73.8% | 51.1% | 1266 | $3.0 | 128K | → |
| 14 |
Qwen 2.5 72B
Alibaba
|
77.4
|
86.0% | 86.6% | 83.1% | 49.0% | 1259 | $0.4 | 128K | → |
| 15 |
Grok-2
xAI
|
77.3
|
87.5% | 88.4% | 76.1% | 56.0% | 1248 | $2.0 | 131K | → |
| 16 |
Gemini 2.0 Flash
Google DeepMind
|
75.7
|
85.0% | 87.4% | 73.0% | 51.0% | 1252 | $0.1 | 1.0M | → |
| 17 |
Gemini 1.5 Pro
Google DeepMind
|
74.4
|
85.9% | 84.1% | 67.7% | 46.2% | 1266 | $3.5 | 1.0M | → |
| 18 |
Llama 4 Scout
Meta
|
74.1
|
87.1% | 86.5% | 67.4% | 47.1% | 1248 | $0.08 | 10.0M | → |
| 19 |
Mistral Large 2
Mistral AI
|
73.9
|
84.0% | 92.0% | 69.3% | 45.0% | 1232 | $3.0 | 128K | → |
| 20 |
GPT-4o mini
OpenAI
|
70.0
|
82.0% | 87.2% | 70.2% | 40.2% | 1179 | $0.15 | 128K | → |
| 21 |
Claude 3 Haiku
Anthropic
|
63.2
|
75.2% | 75.9% | 60.4% | 33.3% | 1168 | $0.25 | 200K | → |
| 22 |
Command R+
Cohere
|
61.6
|
75.7% | 69.6% | 56.7% | 38.3% | 1155 | $2.5 | 128K | → |
Explore the data
Full leaderboard →
Compare models
Pick any two models and compare every metric side-by-side — benchmarks, cost, speed, and context window.
Open comparison →Browse by provider
View all 22 models grouped by provider, with badges and individual capability profiles.
Browse models →AI news feed
Aggregated updates from top AI research labs and publications, refreshed every 5 minutes.
Read the news →Scoring methodology
How models are ranked
MMLU
Knowledge breadth across university-level subjects.
88 subjects, 20% weight
HumanEval
Python function generation from docstrings.
Code gen, 20% weight
MATH
Competition mathematics, multi-step symbolic proofs.
Reasoning, 15% weight
GPQA
Graduate-level expert science questions.
Science, 15% weight
MT-Bench
Multi-turn conversation quality rated by GPT-4.
Dialogue, 15% weight
Arena ELO
Elo from head-to-head human preference battles.
Human votes, 15% weight