Side by Side
Compare Models
Select two models to compare their benchmarks, capabilities, pricing, and performance metrics.
Capability Radar
Claude 3.5 Sonnet
Anthropic
Flagship
Multimodal
80.1
Score
Overall Score
80.1
Human Votes (Arena ELO)
1289
General Knowledge (MMLU)
88.7%
Coding (HumanEval)
92.0%
Maths (MATH)
71.1%
Science (GPQA)
59.4%
Conversation (MT-Bench)
9.0/10
Context Window
200K
Avg Response
780ms
Input Cost / 1M
$3.0
Output Cost / 1M
$15.0
Free & Open Source
Paid API
Multimodal
✓ Yes
Qwen 2.5 72B
Alibaba
Flagship
Open source
77.4
Score
Overall Score
77.4
Human Votes (Arena ELO)
1259
General Knowledge (MMLU)
86.0%
Coding (HumanEval)
86.6%
Maths (MATH)
83.1%
Science (GPQA)
49.0%
Conversation (MT-Bench)
8.9/10
Context Window
128K
Avg Response
750ms
Input Cost / 1M
$0.4
Output Cost / 1M
$1.2
Free & Open Source
✓ Free
Multimodal
✗ No
Highest Overall Score
Claude 3.5 Sonnet 🏆
Scores 80.1 vs 77.4 — leads by 2.7 points out of 100
💡 Qwen 2.5 72B is free & open source — worth considering if cost matters.