Compare Models

Select two models to compare their benchmarks, capabilities, pricing, and performance metrics.

VS
Capability Radar
G
GPT-4o
OpenAI
Flagship Multimodal
79.5
Score
Overall Score
79.5
Arena ELO
1285
MMLU
88.7%
HumanEval
90.2%
MATH
76.6%
GPQA
53.6%
MT-Bench
9.0/10
Context Window
128K
Avg Response
850ms
Input Cost / 1M
$5.0
Output Cost / 1M
$15.0
Open source
✗ No
Multimodal
✓ Yes
C
Claude 3.5 Sonnet
Anthropic
Flagship Multimodal
80.1
Score
Overall Score
80.1
Arena ELO
1289
MMLU
88.7%
HumanEval
92.0%
MATH
71.1%
GPQA
59.4%
MT-Bench
9.0/10
Context Window
200K
Avg Response
780ms
Input Cost / 1M
$3.0
Output Cost / 1M
$15.0
Open source
✗ No
Multimodal
✓ Yes
Overall Winner
Claude 3.5 Sonnet 🏆
Leads by 0.6 points (80.1 vs 79.5)