Side by Side

Compare Models

Select two models to compare their benchmarks, capabilities, pricing, and performance metrics.

Model A

Model B

Capability Radar

GPT-4o

OpenAI

Flagship Multimodal

79.5

Score

Overall Score

79.5

Human Votes (Arena ELO)

1285

General Knowledge (MMLU)

88.7%

Coding (HumanEval)

90.2%

Maths (MATH)

76.6%

Science (GPQA)

53.6%

Conversation (MT-Bench)

9.0/10

Context Window

128K

Avg Response

850ms

Input Cost / 1M

$5.0

Output Cost / 1M

$15.0

Free & Open Source

Paid API

Multimodal

✓ Yes

Full details →

DeepSeek V3

DeepSeek

Flagship Open source

81.0

Score

Overall Score

81.0

Human Votes (Arena ELO)

1302

General Knowledge (MMLU)

88.5%

Coding (HumanEval)

89.1%

Maths (MATH)

87.2%

Science (GPQA)

51.3%

Conversation (MT-Bench)

8.9/10

Context Window

128K

Avg Response

680ms

Input Cost / 1M

$0.27

Output Cost / 1M

$1.1

Free & Open Source

✓ Free

Multimodal

✗ No

Full details →

Highest Overall Score

DeepSeek V3 🏆

Scores 81.0 vs 79.5 — leads by 1.5 points out of 100

💡 DeepSeek V3 is free & open source — the stronger model with no subscription cost.