Rankings
Model Leaderboard
Click any column heading to re-sort the table. Green indicates a winner in that metric, lower cost and faster response are highlighted as advantages.
| # | Model | Score | Arena ELO | MMLU | HumanEval | MATH | Context | Speed | Input Cost | Tags |
|---|---|---|---|---|---|---|---|---|---|---|
| 🥇 |
o3
OpenAI
|
93.7 | 1391 | 91.6% | 96.4% | 97.8% | 200K | 4200ms | $10.0 |
MM
Flagship
|
| 🥈 |
Claude Opus 4
Anthropic
|
90.7 | 1395 | 93.2% | 95.6% | 86.0% | 200K | 1100ms | $15.0 |
MM
Flagship
|
| 🥉 |
Claude Sonnet 4.6
Anthropic
|
88.3 | 1374 | 92.1% | 94.8% | 83.7% | 200K | 790ms | $3.0 |
MM
Flagship
|
| 4 |
DeepSeek R1
DeepSeek
|
88.2 | 1358 | 90.8% | 92.1% | 97.3% | 64K | 2800ms | $0.55 |
OSS
Flagship
|
| 5 |
Claude Sonnet 4.5
Anthropic
|
86.9 | 1362 | 91.7% | 94.2% | 81.5% | 200K | 800ms | $3.0 |
MM
Flagship
|
| 6 |
Gemini 2.5 Pro
Google DeepMind
|
85.9 | 1380 | 90.0% | 87.9% | 91.2% | 1.0M | 1050ms | $1.25 |
MM
Flagship
|
| 7 |
Claude Sonnet 4
Anthropic
|
85.4 | 1345 | 91.0% | 93.5% | 79.2% | 200K | 820ms | $3.0 |
MM
Flagship
|
| 8 |
GPT-4.1
OpenAI
|
85.2 | 1340 | 90.2% | 97.1% | 86.5% | 1.0M | 880ms | $2.0 |
MM
Flagship
|
| 9 |
DeepSeek V3
DeepSeek
|
81.0 | 1302 | 88.5% | 89.1% | 87.2% | 128K | 680ms | $0.27 |
OSS
Flagship
|
| 10 |
Claude 3.5 Sonnet
Anthropic
|
80.1 | 1289 | 88.7% | 92.0% | 71.1% | 200K | 780ms | $3.0 |
MM
Flagship
|
| 11 |
GPT-4o
OpenAI
|
79.5 | 1285 | 88.7% | 90.2% | 76.6% | 128K | 850ms | $5.0 |
MM
Flagship
|
| 12 |
Llama 4 Maverick
Meta
|
78.9 | 1285 | 88.7% | 89.8% | 74.9% | 1.0M | 1150ms | $0.19 |
OSS
MM
Flagship
|
| 13 |
Llama 3.1 405B
Meta
|
77.6 | 1266 | 88.6% | 89.0% | 73.8% | 128K | 1200ms | $3.0 |
OSS
Flagship
|
| 14 |
Qwen 2.5 72B
Alibaba
|
77.4 | 1259 | 86.0% | 86.6% | 83.1% | 128K | 750ms | $0.4 |
OSS
Flagship
|
| 15 |
Grok-2
xAI
|
77.3 | 1248 | 87.5% | 88.4% | 76.1% | 131K | 890ms | $2.0 |
MM
Flagship
|
| 16 |
Gemini 2.0 Flash
Google DeepMind
|
75.7 | 1252 | 85.0% | 87.4% | 73.0% | 1.0M | 520ms | $0.1 |
MM
Efficient
|
| 17 |
Gemini 1.5 Pro
Google DeepMind
|
74.4 | 1266 | 85.9% | 84.1% | 67.7% | 1.0M | 920ms | $3.5 |
MM
Flagship
|
| 18 |
Llama 4 Scout
Meta
|
74.1 | 1248 | 87.1% | 86.5% | 67.4% | 10.0M | 680ms | $0.08 |
OSS
MM
Efficient
|
| 19 |
Mistral Large 2
Mistral AI
|
73.9 | 1232 | 84.0% | 92.0% | 69.3% | 128K | 650ms | $3.0 |
OSS
Flagship
|
| 20 |
GPT-4o mini
OpenAI
|
70.0 | 1179 | 82.0% | 87.2% | 70.2% | 128K | 420ms | $0.15 |
MM
Efficient
|
| 21 |
Claude 3 Haiku
Anthropic
|
63.2 | 1168 | 75.2% | 75.9% | 60.4% | 200K | 380ms | $0.25 |
MM
Efficient
|
| 22 |
Command R+
Cohere
|
61.6 | 1155 | 75.7% | 69.6% | 56.7% | 128K | 720ms | $2.5 |
Flagship
|
Overall Scores
Arena ELO Ratings
MATH Benchmark
HumanEval (Coding)
Legend:
■ Green = best in that column ·
OSS = Open source ·
MM = Multimodal ·
Speed and cost columns are sorted ascending (lower = better).