Side by Side

Compare Models

Select two models to compare their benchmarks, capabilities, pricing, and performance metrics.

Model A

Model B

Capability Radar

Claude 3.5 Sonnet

Anthropic

Flagship Multimodal

80.1

Score

Overall Score

80.1

Human Votes (Arena ELO)

1289

General Knowledge (MMLU)

88.7%

Coding (HumanEval)

92.0%

Maths (MATH)

71.1%

Science (GPQA)

59.4%

Conversation (MT-Bench)

9.0/10

Context Window

200K

Avg Response

780ms

Input Cost / 1M

$3.0

Output Cost / 1M

$15.0

Free & Open Source

Paid API

Multimodal

✓ Yes

Full details →

Qwen 2.5 72B

Alibaba

Flagship Open source

77.4

Score

Overall Score

77.4

Human Votes (Arena ELO)

1259

General Knowledge (MMLU)

86.0%

Coding (HumanEval)

86.6%

Maths (MATH)

83.1%

Science (GPQA)

49.0%

Conversation (MT-Bench)

8.9/10

Context Window

128K

Avg Response

750ms

Input Cost / 1M

$0.4

Output Cost / 1M

$1.2

Free & Open Source

✓ Free

Multimodal

✗ No

Full details →

Highest Overall Score

Claude 3.5 Sonnet 🏆

Scores 80.1 vs 77.4 — leads by 2.7 points out of 100

💡 Qwen 2.5 72B is free & open source — worth considering if cost matters.