Compare Models

Select two models to compare their benchmarks, capabilities, pricing, and performance metrics.

VS
Capability Radar
C
Claude 3.5 Sonnet
Anthropic
Flagship Multimodal
80.1
Score
Overall Score
80.1
Human Votes (Arena ELO)
1289
General Knowledge (MMLU)
88.7%
Coding (HumanEval)
92.0%
Maths (MATH)
71.1%
Science (GPQA)
59.4%
Conversation (MT-Bench)
9.0/10
Context Window
200K
Avg Response
780ms
Input Cost / 1M
$3.0
Output Cost / 1M
$15.0
Free & Open Source
Paid API
Multimodal
✓ Yes
Q
Qwen 2.5 72B
Alibaba
Flagship Open source
77.4
Score
Overall Score
77.4
Human Votes (Arena ELO)
1259
General Knowledge (MMLU)
86.0%
Coding (HumanEval)
86.6%
Maths (MATH)
83.1%
Science (GPQA)
49.0%
Conversation (MT-Bench)
8.9/10
Context Window
128K
Avg Response
750ms
Input Cost / 1M
$0.4
Output Cost / 1M
$1.2
Free & Open Source
✓ Free
Multimodal
✗ No
Highest Overall Score
Claude 3.5 Sonnet 🏆
Scores 80.1 vs 77.4 — leads by 2.7 points out of 100
💡 Qwen 2.5 72B is free & open source — worth considering if cost matters.