Model Leaderboard

Click any column heading to re-sort the table. Green indicates a winner in that metric, lower cost and faster response are highlighted as advantages.

Sort by: Overall Score Arena ELO MMLU HumanEval MATH GPQA Context Window Response Speed Lowest Cost
# Model Score Arena ELO MMLU HumanEval MATH Context Speed Input Cost Tags
🥇
o
o3
OpenAI
93.7 1391 91.6% 96.4% 97.8% 200K 4200ms $10.0
MM Flagship
🥈
C
Claude Opus 4
Anthropic
90.7 1395 93.2% 95.6% 86.0% 200K 1100ms $15.0
MM Flagship
🥉
C
Claude Sonnet 4.6
Anthropic
88.3 1374 92.1% 94.8% 83.7% 200K 790ms $3.0
MM Flagship
4
D
DeepSeek R1
DeepSeek
88.2 1358 90.8% 92.1% 97.3% 64K 2800ms $0.55
OSS Flagship
5
C
Claude Sonnet 4.5
Anthropic
86.9 1362 91.7% 94.2% 81.5% 200K 800ms $3.0
MM Flagship
6
G
Gemini 2.5 Pro
Google DeepMind
85.9 1380 90.0% 87.9% 91.2% 1.0M 1050ms $1.25
MM Flagship
7
C
Claude Sonnet 4
Anthropic
85.4 1345 91.0% 93.5% 79.2% 200K 820ms $3.0
MM Flagship
8
G
GPT-4.1
OpenAI
85.2 1340 90.2% 97.1% 86.5% 1.0M 880ms $2.0
MM Flagship
9
D
DeepSeek V3
DeepSeek
81.0 1302 88.5% 89.1% 87.2% 128K 680ms $0.27
OSS Flagship
10
C
Claude 3.5 Sonnet
Anthropic
80.1 1289 88.7% 92.0% 71.1% 200K 780ms $3.0
MM Flagship
11
G
GPT-4o
OpenAI
79.5 1285 88.7% 90.2% 76.6% 128K 850ms $5.0
MM Flagship
12
L
Llama 4 Maverick
Meta
78.9 1285 88.7% 89.8% 74.9% 1.0M 1150ms $0.19
OSS MM Flagship
13
L
Llama 3.1 405B
Meta
77.6 1266 88.6% 89.0% 73.8% 128K 1200ms $3.0
OSS Flagship
14
Q
Qwen 2.5 72B
Alibaba
77.4 1259 86.0% 86.6% 83.1% 128K 750ms $0.4
OSS Flagship
15
G
Grok-2
xAI
77.3 1248 87.5% 88.4% 76.1% 131K 890ms $2.0
MM Flagship
16
G
Gemini 2.0 Flash
Google DeepMind
75.7 1252 85.0% 87.4% 73.0% 1.0M 520ms $0.1
MM Efficient
17
G
Gemini 1.5 Pro
Google DeepMind
74.4 1266 85.9% 84.1% 67.7% 1.0M 920ms $3.5
MM Flagship
18
L
Llama 4 Scout
Meta
74.1 1248 87.1% 86.5% 67.4% 10.0M 680ms $0.08
OSS MM Efficient
19
M
Mistral Large 2
Mistral AI
73.9 1232 84.0% 92.0% 69.3% 128K 650ms $3.0
OSS Flagship
20
G
GPT-4o mini
OpenAI
70.0 1179 82.0% 87.2% 70.2% 128K 420ms $0.15
MM Efficient
21
C
Claude 3 Haiku
Anthropic
63.2 1168 75.2% 75.9% 60.4% 200K 380ms $0.25
MM Efficient
22
C
Command R+
Cohere
61.6 1155 75.7% 69.6% 56.7% 128K 720ms $2.5
Flagship
Overall Scores
Arena ELO Ratings
MATH Benchmark
HumanEval (Coding)
Legend: ■ Green = best in that column  ·  OSS = Open source  ·  MM = Multimodal  ·  Speed and cost columns are sorted ascending (lower = better).