Side by Side
Compare Models
Select two models to compare their benchmarks, capabilities, pricing, and performance metrics.
Capability Radar
GPT-4o
OpenAI
Flagship
Multimodal
79.5
Score
Overall Score
79.5
Human Votes (Arena ELO)
1285
General Knowledge (MMLU)
88.7%
Coding (HumanEval)
90.2%
Maths (MATH)
76.6%
Science (GPQA)
53.6%
Conversation (MT-Bench)
9.0/10
Context Window
128K
Avg Response
850ms
Input Cost / 1M
$5.0
Output Cost / 1M
$15.0
Free & Open Source
Paid API
Multimodal
✓ Yes
Claude 3.5 Sonnet
Anthropic
Flagship
Multimodal
80.1
Score
Overall Score
80.1
Human Votes (Arena ELO)
1289
General Knowledge (MMLU)
88.7%
Coding (HumanEval)
92.0%
Maths (MATH)
71.1%
Science (GPQA)
59.4%
Conversation (MT-Bench)
9.0/10
Context Window
200K
Avg Response
780ms
Input Cost / 1M
$3.0
Output Cost / 1M
$15.0
Free & Open Source
Paid API
Multimodal
✓ Yes
Highest Overall Score
Claude 3.5 Sonnet 🏆
Scores 80.1 vs 79.5 — leads by 0.6 points out of 100