Filter by use case
No models match this filter.
# Model Score MMLU HumanEval MATH GPQA Arena ELO Input $/M Context
1
o
o3
OpenAI
93.7
91.6% 96.4% 97.8% 87.7% 1391 $10.0 200K
2
C
Claude Opus 4
Anthropic
90.7
93.2% 95.6% 86.0% 74.2% 1395 $15.0 200K
3
C
Claude Sonnet 4.6
Anthropic
88.3
92.1% 94.8% 83.7% 69.1% 1374 $3.0 200K
4
D
DeepSeek R1
DeepSeek
88.2
90.8% 92.1% 97.3% 71.5% 1358 $0.55 64K
5
C
Claude Sonnet 4.5
Anthropic
86.9
91.7% 94.2% 81.5% 67.4% 1362 $3.0 200K
6
G
Gemini 2.5 Pro
Google DeepMind
85.9
90.0% 87.9% 91.2% 59.1% 1380 $1.25 1.0M
7
C
Claude Sonnet 4
Anthropic
85.4
91.0% 93.5% 79.2% 65.8% 1345 $3.0 200K
8
G
GPT-4.1
OpenAI
85.2
90.2% 97.1% 86.5% 56.8% 1340 $2.0 1.0M
9
D
DeepSeek V3
DeepSeek
81.0
88.5% 89.1% 87.2% 51.3% 1302 $0.27 128K
10
C
Claude 3.5 Sonnet
Anthropic
80.1
88.7% 92.0% 71.1% 59.4% 1289 $3.0 200K
11
G
GPT-4o
OpenAI
79.5
88.7% 90.2% 76.6% 53.6% 1285 $5.0 128K
12
L
Llama 4 Maverick
Meta
78.9
88.7% 89.8% 74.9% 52.8% 1285 $0.19 1.0M
13
L
Llama 3.1 405B
Meta
77.6
88.6% 89.0% 73.8% 51.1% 1266 $3.0 128K
14
Q
Qwen 2.5 72B
Alibaba
77.4
86.0% 86.6% 83.1% 49.0% 1259 $0.4 128K
15
G
Grok-2
xAI
77.3
87.5% 88.4% 76.1% 56.0% 1248 $2.0 131K
16
G
Gemini 2.0 Flash
Google DeepMind
75.7
85.0% 87.4% 73.0% 51.0% 1252 $0.1 1.0M
17
G
Gemini 1.5 Pro
Google DeepMind
74.4
85.9% 84.1% 67.7% 46.2% 1266 $3.5 1.0M
18
L
Llama 4 Scout
Meta
74.1
87.1% 86.5% 67.4% 47.1% 1248 $0.08 10.0M
19
M
Mistral Large 2
Mistral AI
73.9
84.0% 92.0% 69.3% 45.0% 1232 $3.0 128K
20
G
GPT-4o mini
OpenAI
70.0
82.0% 87.2% 70.2% 40.2% 1179 $0.15 128K
21
C
Claude 3 Haiku
Anthropic
63.2
75.2% 75.9% 60.4% 33.3% 1168 $0.25 200K
22
C
Command R+
Cohere
61.6
75.7% 69.6% 56.7% 38.3% 1155 $2.5 128K
How models are ranked
Full methodology →
MMLU
Knowledge breadth across university-level subjects.
88 subjects, 20% weight
HumanEval
Python function generation from docstrings.
Code gen, 20% weight
MATH
Competition mathematics, multi-step symbolic proofs.
Reasoning, 15% weight
GPQA
Graduate-level expert science questions.
Science, 15% weight
MT-Bench
Multi-turn conversation quality rated by GPT-4.
Dialogue, 15% weight
Arena ELO
Elo from head-to-head human preference battles.
Human votes, 15% weight