Open-Source Model Leaderboard

Full Leaderboard →
# Model Score MMLU MATH HumanEval GPQA Arena ELO Tags
1
D
DeepSeek R1
DeepSeek
88.2 90.8% 97.3% 92.1% 71.5% 1358
OSS Flagship
2
Q
Qwen3 72B
Alibaba
84.9 88.0% 90.0% 92.5% 65.4% 1320
OSS Flagship
3
D
DeepSeek V3.1 671B
DeepSeek
83.6 89.0% 87.0% 91.0% 62.0% 1310
OSS Flagship
4
Q
Qwen3.5 122B
Alibaba
82.2 87.0% 89.0% 94.0% 57.0% 1280
OSS Flagship
5
M
Mistral Large 3 675B
Mistral AI
82.1 88.0% 85.0% 90.0% 60.0% 1295
OSS Flagship
6
D
DeepSeek V3
DeepSeek
81.0 88.5% 87.2% 89.1% 51.3% 1302
OSS Flagship
7
G
GPT-OSS 120B
OpenAI
80.9 87.0% 84.0% 90.0% 58.0% 1285
OSS Flagship
8
L
Llama 4 Maverick
Meta
78.9 88.7% 74.9% 89.8% 52.8% 1285
OSS MM Flagship
9
D
DeepSeek R1 Distill 70B
DeepSeek
78.5 84.0% 90.0% 86.0% 55.0% 1250
OSS Flagship
10
Q
Qwen3.5 35B
Alibaba
77.9 83.0% 86.0% 92.0% 50.0% 1245
OSS Flagship
11
L
Llama 3.1 405B
Meta
77.6 88.6% 73.8% 89.0% 51.1% 1266
OSS Flagship
12
D
Devstral 2 123B
Mistral AI
77.5 82.0% 79.0% 94.0% 50.0% 1255
OSS Flagship
13
Q
Qwen 2.5 72B
Alibaba
77.4 86.0% 83.1% 86.6% 49.0% 1259
OSS Flagship
14
P
Phi-4
Microsoft
77.4 84.8% 80.4% 82.6% 56.1% 1280
OSS Efficient
15
L
Llama 3.3 70B
Meta
77.0 86.0% 77.0% 88.0% 50.5% 1256
OSS Flagship
16
Q
Qwen3 14B
Alibaba
76.3 82.0% 88.0% 87.0% 50.0% 1230
OSS Efficient
17
D
DeepSeek V2.5 236B
DeepSeek
76.2 80.4% 75.7% 89.0% 50.7% 1268
OSS Flagship
18
G
Gemma 3 27B
Google DeepMind
75.6 87.5% 72.0% 77.2% 52.9% 1290
OSS MM Efficient
19
Q
Qwen3 Coder 30B
Alibaba
75.2 78.0% 82.0% 93.0% 44.0% 1240
OSS Flagship
20
Q
Qwen3 VL 32B
Alibaba
74.6 81.0% 85.0% 84.0% 48.0% 1225
OSS MM Flagship
21
L
Llama 4 Scout
Meta
74.1 87.1% 67.4% 86.5% 47.1% 1248
OSS MM Efficient
22
M
Mistral Large 2
Mistral AI
73.9 84.0% 69.3% 92.0% 45.0% 1232
OSS Flagship
23
C
Cogito 70B
DeepCogito
72.9 82.0% 75.0% 82.0% 48.0% 1230
OSS Flagship
24
Q
Qwen 2.5 14B
Alibaba
72.6 79.5% 83.0% 86.0% 42.0% 1210
OSS Efficient
25
L
Llama 3.2 Vision 90B
Meta
71.6 83.0% 69.0% 81.0% 46.0% 1228
OSS MM Flagship
26
L
Llama 3.1 70B
Meta
71.2 83.6% 66.4% 80.5% 46.7% 1220
OSS Flagship
27
N
Nemotron 3 Nano 30B
NVIDIA
69.0 78.0% 72.0% 80.0% 40.0% 1210
OSS Efficient
28
Q
Qwen 2.5 7B
Alibaba
68.7 74.2% 80.0% 84.5% 36.0% 1185
OSS Efficient
29
D
Devstral Small 2 24B
Mistral AI
67.9 74.0% 68.0% 88.0% 36.0% 1195
OSS Efficient
30
G
Gemma 3 12B
Google DeepMind
67.6 78.0% 72.0% 76.0% 38.0% 1200
OSS MM Efficient
31
G
GLM 4.7
Zhipu AI
66.0 75.0% 70.0% 76.0% 36.0% 1200
OSS Efficient
32
Q
Qwen3 VL 8B
Alibaba
64.8 72.0% 72.0% 78.0% 34.0% 1175
OSS MM Efficient
33
M
Ministral 3 14B
Mistral AI
64.0 73.0% 62.0% 78.0% 35.0% 1185
OSS Efficient
34
O
OLMo 3 32B
AllenAI
64.0 75.0% 62.0% 74.0% 35.0% 1195
OSS Efficient
35
M
Mixtral 8x7B
Mistral AI
63.0 70.6% 58.0% 75.1% 35.0% 1191
OSS Flagship
36
P
Phi-3.5 Mini
Microsoft
61.9 69.0% 69.0% 78.0% 30.0% 1150
OSS Efficient
37
G
Gemma 2 9B
Google DeepMind
61.7 71.3% 58.0% 71.0% 33.0% 1190
OSS Efficient
38
M
Mathstral 7B
Mistral AI
61.4 64.0% 86.0% 60.0% 38.0% 1165
OSS Efficient
39
L
Llama 3.2 Vision 11B
Meta
61.2 73.0% 58.0% 72.0% 32.0% 1175
OSS MM Efficient
40
M
Mistral Nemo 12B
Mistral AI
60.9 68.0% 55.0% 75.0% 33.0% 1180
OSS Efficient
41
G
Granite Code 34B
IBM
60.6 60.0% 56.0% 86.0% 28.0% 1180
OSS Flagship
42
L
Llama 3.1 8B
Meta
60.5 73.0% 51.9% 72.6% 32.8% 1170
OSS Efficient
43
M
Ministral 3 8B
Mistral AI
58.2 67.0% 52.0% 74.0% 30.0% 1155
OSS Efficient
44
G
Gemma 3 4B
Google DeepMind
58.1 68.0% 62.0% 64.0% 30.0% 1160
OSS MM Efficient
45
C
CodeGemma 7B
Google DeepMind
55.4 54.0% 50.0% 82.0% 25.0% 1145
OSS Efficient
46
M
Mistral 7B
Mistral AI
54.8 64.2% 40.5% 73.0% 28.8% 1141
OSS Efficient
47
O
OLMo 3 7B
AllenAI
53.4 65.0% 45.0% 64.0% 27.0% 1140
OSS Efficient
48
G
Granite Code 8B
IBM
50.3 51.0% 40.0% 75.0% 20.0% 1130
OSS Efficient
49
L
Llama 3.2 3B
Meta
49.8 63.4% 40.0% 58.0% 24.0% 1120
OSS Efficient
50
M
Ministral 3 3B
Mistral AI
48.5 61.0% 42.0% 55.0% 23.0% 1115
OSS Efficient
51
L
Llama 3.2 1B
Meta
35.8 49.3% 25.0% 38.0% 15.0% 1070
OSS Efficient
52
G
Gemma 3 1B
Google DeepMind
35.5 44.0% 32.0% 40.0% 18.0% 1050
OSS Efficient

Score is a weighted composite of MMLU, HumanEval, MATH, GPQA, MT-Bench and Arena ELO. Methodology →

GPU Specifications

NVIDIA DGX Spark
GB10 Grace Blackwell
Workstation
VRAM
128GB
Memory
LPDDR5X Unified
Bandwidth
273 GB/s
MSRP
$3999
Up to 123B models natively; largest single-machine OSS runtime
GeForce RTX 5090
Blackwell GB202
Enthusiast
VRAM
32GB
Memory
GDDR7
Bandwidth
1792 GB/s
MSRP
$1999
Up to 14B natively; 70B class with partial CPU offload
GeForce RTX 5080
Blackwell GB203
High-End
VRAM
16GB
Memory
GDDR7
Bandwidth
960 GB/s
MSRP
$999
Up to 14B at Q4; ideal for 7–8B class at full speed
GeForce RTX 4090
Ada Lovelace AD102
Enthusiast
VRAM
24GB
Memory
GDDR6X
Bandwidth
1008 GB/s
MSRP
$1599
Up to 14B natively; still the most popular enthusiast LLM GPU
GeForce RTX 4080 Super
Ada Lovelace AD103
High-End
VRAM
16GB
Memory
GDDR6X
Bandwidth
736 GB/s
MSRP
$999
Up to 14B at Q4; solid price-to-performance choice
GeForce RTX 3090
Ampere GA102
Enthusiast
VRAM
24GB
Memory
GDDR6X
Bandwidth
936 GB/s
MSRP
$699
Up to 14B natively; excellent second-hand value at 24GB
GeForce RTX 3080 Ti
Ampere GA102
High-End
VRAM
12GB
Memory
GDDR6X
Bandwidth
912 GB/s
MSRP
$450
7–8B models at good speed; tight but functional for 14B at Q4

Performance Matrix (tokens / second)

Legend
Native — fits in VRAM without quantisation
Quantised — fits with Q4 or lower quantisation
Offload — partial CPU RAM offloading required
Server Only — requires multi-GPU server hardware
Model DGX Spark
128GB
RTX 5090
32GB
RTX 5080
16GB
RTX 4090
24GB
RTX 4080 Super
16GB
RTX 3090
24GB
RTX 3080 Ti
12GB
G
Gemma 3 1B
Google DeepMind · 1B Dense
550t/s Q8 500t/s Q8 420t/s Q8 400t/s Q8 340t/s Q8 320t/s Q8 300t/s Q8
L
Llama 3.2 1B
Meta · 1B Dense
550t/s Q8 500t/s Q8 420t/s Q8 400t/s Q8 340t/s Q8 320t/s Q8 300t/s Q8
L
Llama 3.2 3B
Meta · 3B Dense
350t/s Q8 280t/s Q8 240t/s Q8 230t/s Q8 190t/s Q8 180t/s Q8 165t/s Q8
M
Ministral 3 3B
Mistral AI · 3B Dense
350t/s Q8 280t/s Q8 240t/s Q8 230t/s Q8 190t/s Q8 180t/s Q8 165t/s Q8
P
Phi-3.5 Mini 3.8B
Microsoft · 3.8B Dense
300t/s Q8 240t/s Q8 200t/s Q8 195t/s Q8 160t/s Q8 152t/s Q8 140t/s Q8
G
Gemma 3 4B
Google DeepMind · 4B Dense
300t/s Q8 240t/s Q8 200t/s Q8 195t/s Q8 160t/s Q8 152t/s Q8 140t/s Q8
M
Mistral 7B
Mistral AI · 7B Dense
185t/s Q8 140t/s Q8 112t/s Q8 108t/s Q8 90t/s Q8 85t/s Q8 75t/s Q8
M
Mathstral 7B
Mistral AI · 7B Dense
185t/s Q8 140t/s Q8 112t/s Q8 108t/s Q8 90t/s Q8 85t/s Q8 75t/s Q8
Q
Qwen 2.5 7B
Alibaba · 7B Dense
185t/s Q8 140t/s Q8 112t/s Q8 108t/s Q8 90t/s Q8 85t/s Q8 75t/s Q8
O
OLMo 3 7B
AllenAI · 7B Dense
185t/s Q8 140t/s Q8 112t/s Q8 108t/s Q8 90t/s Q8 85t/s Q8 75t/s Q8
C
CodeGemma 7B
Google DeepMind · 7B Dense
185t/s Q8 140t/s Q8 112t/s Q8 108t/s Q8 90t/s Q8 85t/s Q8 75t/s Q8
L
Llama 3.1 8B
Meta · 8B Dense
185t/s Q8 135t/s Q8 112t/s Q8 108t/s Q8 90t/s Q8 85t/s Q8 72t/s Q4_K_M
M
Ministral 3 8B
Mistral AI · 8B Dense
175t/s Q8 130t/s Q8 106t/s Q8 102t/s Q8 85t/s Q8 80t/s Q8 70t/s Q8
G
Granite Code 8B
IBM · 8B Dense
175t/s Q8 130t/s Q8 106t/s Q8 102t/s Q8 85t/s Q8 80t/s Q8 70t/s Q8
Q
Qwen3 VL 8B
Alibaba · 8B Dense
175t/s Q8 130t/s Q8 106t/s Q8 102t/s Q8 85t/s Q8 80t/s Q8 70t/s Q8
G
GLM 4.7
Zhipu AI · 9B Dense
160t/s Q8 118t/s Q8 99t/s Q8 95t/s Q8 79t/s Q8 74t/s Q8 65t/s Q8
G
Gemma 2 9B
Google DeepMind · 9B Dense
160t/s Q8 118t/s Q8 99t/s Q8 95t/s Q8 79t/s Q8 74t/s Q8 65t/s Q8
L
Llama 3.2 Vision 11B
Meta · 11B Dense
140t/s Q8 105t/s Q8 88t/s Q8 84t/s Q8 70t/s Q8 65t/s Q8 67t/s Q4_K_M
M
Mistral Nemo 12B
Mistral AI · 12B Dense
135t/s Q8 100t/s Q8 84t/s Q8 80t/s Q8 67t/s Q8 62t/s Q8 62t/s Q4_K_M
G
Gemma 3 12B
Google DeepMind · 12B Dense
135t/s Q8 100t/s Q8 84t/s Q8 80t/s Q8 67t/s Q8 62t/s Q8 62t/s Q4_K_M
P
Phi-4 14B
Microsoft · 14B Dense
130t/s Q8 92t/s Q8 70t/s Q4_K_M 74t/s Q8 60t/s Q4_K_M 60t/s Q8 42t/s Q4_K_M
Q
Qwen 2.5 14B
Alibaba · 14B Dense
125t/s Q8 90t/s Q8 83t/s Q4_K_M 74t/s Q8 69t/s Q4_K_M 60t/s Q8 48t/s Q4_K_M
Q
Qwen3 14B
Alibaba · 14B Dense
125t/s Q8 90t/s Q8 83t/s Q4_K_M 74t/s Q8 69t/s Q4_K_M 60t/s Q8 48t/s Q4_K_M
M
Ministral 3 14B
Mistral AI · 14B Dense
125t/s Q8 90t/s Q8 83t/s Q4_K_M 74t/s Q8 69t/s Q4_K_M 60t/s Q8 48t/s Q4_K_M
D
Devstral Small 2 24B
Mistral AI · 24B Dense
75t/s Q8 52t/s Q8 44t/s Q4_K_M 48t/s Q4_K_M 37t/s Q4_K_M 44t/s Q4_K_M 3t/s Q4_K_M
G
Gemma 3 27B
Google DeepMind · 27B Dense
70t/s Q8 55t/s Q4_K_M 7t/s Q4_K_M 44t/s Q4_K_M 5t/s Q4_K_M 39t/s Q4_K_M 2t/s Q4_K_M
Q
Qwen3 Coder 30B
Alibaba · 30B Dense
62t/s Q8 51t/s Q4_K_M 5t/s Q4_K_M 39t/s Q4_K_M 5t/s Q4_K_M 34t/s Q4_K_M 2t/s Q4_K_M
N
Nemotron 3 Nano 30B
NVIDIA · 30B Dense
62t/s Q8 51t/s Q4_K_M 5t/s Q4_K_M 39t/s Q4_K_M 5t/s Q4_K_M 34t/s Q4_K_M 2t/s Q4_K_M
Q
Qwen3 VL 32B
Alibaba · 32B Dense
58t/s Q8 48t/s Q4_K_M 5t/s Q4_K_M 37t/s Q4_K_M 4t/s Q4_K_M 32t/s Q4_K_M 1t/s Q4_K_M
O
OLMo 3 32B
AllenAI · 32B Dense
58t/s Q8 48t/s Q4_K_M 5t/s Q4_K_M 37t/s Q4_K_M 4t/s Q4_K_M 32t/s Q4_K_M 1t/s Q4_K_M
G
Granite Code 34B
IBM · 34B Dense
55t/s Q8 46t/s Q4_K_M 4t/s Q4_K_M 34t/s Q4_K_M 3t/s Q4_K_M 30t/s Q4_K_M 1t/s Q4_K_M
Q
Qwen3.5 35B
Alibaba · 35B Dense
54t/s Q8 44t/s Q4_K_M 4t/s Q4_K_M 33t/s Q4_K_M 3t/s Q4_K_M 29t/s Q4_K_M 1t/s Q4_K_M
M
Mixtral 8x7B
Mistral AI · 46B MoE (12B active)
150t/s Q8 110t/s Q8 28t/s Q4_K_M 82t/s Q4_K_M 24t/s Q4_K_M 72t/s Q4_K_M 8t/s Q4_K_M
L
Llama 3.1 70B
Meta · 70B Dense
55t/s Q8 3t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
L
Llama 3.3 70B
Meta · 70B Dense
55t/s Q8 3t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
D
DeepSeek R1 Distill 70B
DeepSeek · 70B Dense
55t/s Q8 3t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
C
Cogito 70B
DeepCogito · 70B Dense
55t/s Q8 3t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
Q
Qwen 2.5 72B
Alibaba · 72B Dense
55t/s Q4_K_M 20t/s Q4_K_M 6t/s Q4_K_M 12t/s Q4_K_M 5t/s Q4_K_M 9t/s Q4_K_M 2t/s Q4_K_M
Q
Qwen3 72B
Alibaba · 72B Dense
55t/s Q8 3t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
L
Llama 3.2 Vision 90B
Meta · 90B Dense
38t/s Q8 1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
1t/s Q4_K_M Server
Only
L
Llama 4 Scout
Meta · 109B MoE (8B active)
48t/s Q4_K_M 22t/s Q4_K_M 7t/s Q4_K_M 14t/s Q4_K_M 6t/s Q4_K_M 10t/s Q4_K_M Server
Only
G
GPT-OSS 120B
OpenAI · 120B Dense
40t/s Q4_K_M 1t/s Q4_K_M Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Q
Qwen3.5 122B
Alibaba · 122B Dense
40t/s Q4_K_M 1t/s Q4_K_M Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
M
Mistral Large 2
Mistral AI · 123B Dense
35t/s Q4_K_M 5t/s Q4_K_M 2t/s Q4_K_M 3t/s Q4_K_M 1t/s Q4_K_M 2t/s Q4_K_M Server
Only
D
Devstral 2 123B
Mistral AI · 123B Dense
40t/s Q4_K_M 1t/s Q4_K_M Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
D
DeepSeek V2.5 236B
DeepSeek · 236B MoE (21B active)
18t/s IQ2_XS 3t/s Q4_K_M Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
L
Llama 4 Maverick
Meta · 400B MoE (17B active)
8t/s IQ2_XS Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
L
Llama 3.1 405B
Meta · 405B Dense
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
D
DeepSeek R1
DeepSeek · 671B Dense
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
D
DeepSeek V3.1 671B
DeepSeek · 671B MoE (37B active)
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
M
Mistral Large 3 675B
Mistral AI · 675B Dense
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
D
DeepSeek V3
DeepSeek · 685B Dense
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only
Server
Only

Figures are approximate generation throughput (output tokens/second) measured with llama.cpp on a system with 64GB DDR5 RAM and NVMe storage. CPU offload speeds are highly dependent on system RAM bandwidth. DGX Spark figures are from NVIDIA's unified-memory architecture which eliminates the VRAM/RAM split. Results may vary by batch size, context length, and driver version.

Which GPU is Right for You?

Budget Pick
RTX 3090 · ~$699 used

24GB GDDR6X makes it surprisingly capable for open-source LLMs. Runs Phi-4 14B natively at Q8 and handles Qwen 2.5 72B with partial CPU offload. The best second-hand value for local AI.

Best Value New
RTX 5080 · $999

GDDR7 delivers massive bandwidth improvements over the 4080 Super at the same price. Runs any 14B model natively and handles 7–8B class models at peak speed. A strong all-round choice for 2025.

Enthusiast Pick
RTX 5090 · $1,999

32GB GDDR7 with the highest bandwidth available to consumers. Runs Qwen 2.5 72B at usable speeds with partial offload and handles 14B models faster than any other single GPU. The definitive consumer AI card.

Workstation Pick
NVIDIA DGX Spark · $3,999

128GB unified memory is the game-changer. Run Mistral Large 2 (123B) and Llama 4 Scout (109B) natively without any offloading. The only consumer device that can fully infer 100B+ parameter models at reasonable speeds.

Server / Multi-GPU
8× H100 SXM / DGX H100

DeepSeek R1, DeepSeek V3 and Llama 3.1 405B require 320–800GB of GPU memory at Q4. These models only run meaningfully on multi-GPU server configurations with NVLink — or via cloud inference APIs.

Cloud Option
Together AI / Groq / Replicate

For frontier open-source models (DeepSeek R1, Llama 405B) without server hardware, cloud inference providers offer cost-effective access at a fraction of the capital outlay. Worth considering for low-frequency workloads.