Open-Source AI on Your GPU
Rankings of the best open-source models and real-world inference speed across consumer and workstation NVIDIA hardware.
Open-Source Model Leaderboard
| # | Model | Score | MMLU | MATH | HumanEval | GPQA | Arena ELO | Tags |
|---|---|---|---|---|---|---|---|---|
| 1 |
DeepSeek R1
DeepSeek
|
88.2 | 90.8% | 97.3% | 92.1% | 71.5% | 1358 |
OSS
Flagship
|
| 2 |
Qwen3 72B
Alibaba
|
84.9 | 88.0% | 90.0% | 92.5% | 65.4% | 1320 |
OSS
Flagship
|
| 3 |
DeepSeek V3.1 671B
DeepSeek
|
83.6 | 89.0% | 87.0% | 91.0% | 62.0% | 1310 |
OSS
Flagship
|
| 4 |
Qwen3.5 122B
Alibaba
|
82.2 | 87.0% | 89.0% | 94.0% | 57.0% | 1280 |
OSS
Flagship
|
| 5 |
Mistral Large 3 675B
Mistral AI
|
82.1 | 88.0% | 85.0% | 90.0% | 60.0% | 1295 |
OSS
Flagship
|
| 6 |
DeepSeek V3
DeepSeek
|
81.0 | 88.5% | 87.2% | 89.1% | 51.3% | 1302 |
OSS
Flagship
|
| 7 |
GPT-OSS 120B
OpenAI
|
80.9 | 87.0% | 84.0% | 90.0% | 58.0% | 1285 |
OSS
Flagship
|
| 8 |
Llama 4 Maverick
Meta
|
78.9 | 88.7% | 74.9% | 89.8% | 52.8% | 1285 |
OSS
MM
Flagship
|
| 9 |
DeepSeek R1 Distill 70B
DeepSeek
|
78.5 | 84.0% | 90.0% | 86.0% | 55.0% | 1250 |
OSS
Flagship
|
| 10 |
Qwen3.5 35B
Alibaba
|
77.9 | 83.0% | 86.0% | 92.0% | 50.0% | 1245 |
OSS
Flagship
|
| 11 |
Llama 3.1 405B
Meta
|
77.6 | 88.6% | 73.8% | 89.0% | 51.1% | 1266 |
OSS
Flagship
|
| 12 |
Devstral 2 123B
Mistral AI
|
77.5 | 82.0% | 79.0% | 94.0% | 50.0% | 1255 |
OSS
Flagship
|
| 13 |
Qwen 2.5 72B
Alibaba
|
77.4 | 86.0% | 83.1% | 86.6% | 49.0% | 1259 |
OSS
Flagship
|
| 14 |
Phi-4
Microsoft
|
77.4 | 84.8% | 80.4% | 82.6% | 56.1% | 1280 |
OSS
Efficient
|
| 15 |
Llama 3.3 70B
Meta
|
77.0 | 86.0% | 77.0% | 88.0% | 50.5% | 1256 |
OSS
Flagship
|
| 16 |
Qwen3 14B
Alibaba
|
76.3 | 82.0% | 88.0% | 87.0% | 50.0% | 1230 |
OSS
Efficient
|
| 17 |
DeepSeek V2.5 236B
DeepSeek
|
76.2 | 80.4% | 75.7% | 89.0% | 50.7% | 1268 |
OSS
Flagship
|
| 18 |
Gemma 3 27B
Google DeepMind
|
75.6 | 87.5% | 72.0% | 77.2% | 52.9% | 1290 |
OSS
MM
Efficient
|
| 19 |
Qwen3 Coder 30B
Alibaba
|
75.2 | 78.0% | 82.0% | 93.0% | 44.0% | 1240 |
OSS
Flagship
|
| 20 |
Qwen3 VL 32B
Alibaba
|
74.6 | 81.0% | 85.0% | 84.0% | 48.0% | 1225 |
OSS
MM
Flagship
|
| 21 |
Llama 4 Scout
Meta
|
74.1 | 87.1% | 67.4% | 86.5% | 47.1% | 1248 |
OSS
MM
Efficient
|
| 22 |
Mistral Large 2
Mistral AI
|
73.9 | 84.0% | 69.3% | 92.0% | 45.0% | 1232 |
OSS
Flagship
|
| 23 |
Cogito 70B
DeepCogito
|
72.9 | 82.0% | 75.0% | 82.0% | 48.0% | 1230 |
OSS
Flagship
|
| 24 |
Qwen 2.5 14B
Alibaba
|
72.6 | 79.5% | 83.0% | 86.0% | 42.0% | 1210 |
OSS
Efficient
|
| 25 |
Llama 3.2 Vision 90B
Meta
|
71.6 | 83.0% | 69.0% | 81.0% | 46.0% | 1228 |
OSS
MM
Flagship
|
| 26 |
Llama 3.1 70B
Meta
|
71.2 | 83.6% | 66.4% | 80.5% | 46.7% | 1220 |
OSS
Flagship
|
| 27 |
Nemotron 3 Nano 30B
NVIDIA
|
69.0 | 78.0% | 72.0% | 80.0% | 40.0% | 1210 |
OSS
Efficient
|
| 28 |
Qwen 2.5 7B
Alibaba
|
68.7 | 74.2% | 80.0% | 84.5% | 36.0% | 1185 |
OSS
Efficient
|
| 29 |
Devstral Small 2 24B
Mistral AI
|
67.9 | 74.0% | 68.0% | 88.0% | 36.0% | 1195 |
OSS
Efficient
|
| 30 |
Gemma 3 12B
Google DeepMind
|
67.6 | 78.0% | 72.0% | 76.0% | 38.0% | 1200 |
OSS
MM
Efficient
|
| 31 |
GLM 4.7
Zhipu AI
|
66.0 | 75.0% | 70.0% | 76.0% | 36.0% | 1200 |
OSS
Efficient
|
| 32 |
Qwen3 VL 8B
Alibaba
|
64.8 | 72.0% | 72.0% | 78.0% | 34.0% | 1175 |
OSS
MM
Efficient
|
| 33 |
Ministral 3 14B
Mistral AI
|
64.0 | 73.0% | 62.0% | 78.0% | 35.0% | 1185 |
OSS
Efficient
|
| 34 |
OLMo 3 32B
AllenAI
|
64.0 | 75.0% | 62.0% | 74.0% | 35.0% | 1195 |
OSS
Efficient
|
| 35 |
Mixtral 8x7B
Mistral AI
|
63.0 | 70.6% | 58.0% | 75.1% | 35.0% | 1191 |
OSS
Flagship
|
| 36 |
Phi-3.5 Mini
Microsoft
|
61.9 | 69.0% | 69.0% | 78.0% | 30.0% | 1150 |
OSS
Efficient
|
| 37 |
Gemma 2 9B
Google DeepMind
|
61.7 | 71.3% | 58.0% | 71.0% | 33.0% | 1190 |
OSS
Efficient
|
| 38 |
Mathstral 7B
Mistral AI
|
61.4 | 64.0% | 86.0% | 60.0% | 38.0% | 1165 |
OSS
Efficient
|
| 39 |
Llama 3.2 Vision 11B
Meta
|
61.2 | 73.0% | 58.0% | 72.0% | 32.0% | 1175 |
OSS
MM
Efficient
|
| 40 |
Mistral Nemo 12B
Mistral AI
|
60.9 | 68.0% | 55.0% | 75.0% | 33.0% | 1180 |
OSS
Efficient
|
| 41 |
Granite Code 34B
IBM
|
60.6 | 60.0% | 56.0% | 86.0% | 28.0% | 1180 |
OSS
Flagship
|
| 42 |
Llama 3.1 8B
Meta
|
60.5 | 73.0% | 51.9% | 72.6% | 32.8% | 1170 |
OSS
Efficient
|
| 43 |
Ministral 3 8B
Mistral AI
|
58.2 | 67.0% | 52.0% | 74.0% | 30.0% | 1155 |
OSS
Efficient
|
| 44 |
Gemma 3 4B
Google DeepMind
|
58.1 | 68.0% | 62.0% | 64.0% | 30.0% | 1160 |
OSS
MM
Efficient
|
| 45 |
CodeGemma 7B
Google DeepMind
|
55.4 | 54.0% | 50.0% | 82.0% | 25.0% | 1145 |
OSS
Efficient
|
| 46 |
Mistral 7B
Mistral AI
|
54.8 | 64.2% | 40.5% | 73.0% | 28.8% | 1141 |
OSS
Efficient
|
| 47 |
OLMo 3 7B
AllenAI
|
53.4 | 65.0% | 45.0% | 64.0% | 27.0% | 1140 |
OSS
Efficient
|
| 48 |
Granite Code 8B
IBM
|
50.3 | 51.0% | 40.0% | 75.0% | 20.0% | 1130 |
OSS
Efficient
|
| 49 |
Llama 3.2 3B
Meta
|
49.8 | 63.4% | 40.0% | 58.0% | 24.0% | 1120 |
OSS
Efficient
|
| 50 |
Ministral 3 3B
Mistral AI
|
48.5 | 61.0% | 42.0% | 55.0% | 23.0% | 1115 |
OSS
Efficient
|
| 51 |
Llama 3.2 1B
Meta
|
35.8 | 49.3% | 25.0% | 38.0% | 15.0% | 1070 |
OSS
Efficient
|
| 52 |
Gemma 3 1B
Google DeepMind
|
35.5 | 44.0% | 32.0% | 40.0% | 18.0% | 1050 |
OSS
Efficient
|
Score is a weighted composite of MMLU, HumanEval, MATH, GPQA, MT-Bench and Arena ELO. Methodology →
GPU Specifications
Performance Matrix (tokens / second)
| Model | DGX Spark 128GB |
RTX 5090 32GB |
RTX 5080 16GB |
RTX 4090 24GB |
RTX 4080 Super 16GB |
RTX 3090 24GB |
RTX 3080 Ti 12GB |
|---|---|---|---|---|---|---|---|
|
Gemma 3 1B
Google DeepMind · 1B Dense
|
550t/s Q8 | 500t/s Q8 | 420t/s Q8 | 400t/s Q8 | 340t/s Q8 | 320t/s Q8 | 300t/s Q8 |
|
Llama 3.2 1B
Meta · 1B Dense
|
550t/s Q8 | 500t/s Q8 | 420t/s Q8 | 400t/s Q8 | 340t/s Q8 | 320t/s Q8 | 300t/s Q8 |
|
Llama 3.2 3B
Meta · 3B Dense
|
350t/s Q8 | 280t/s Q8 | 240t/s Q8 | 230t/s Q8 | 190t/s Q8 | 180t/s Q8 | 165t/s Q8 |
|
Ministral 3 3B
Mistral AI · 3B Dense
|
350t/s Q8 | 280t/s Q8 | 240t/s Q8 | 230t/s Q8 | 190t/s Q8 | 180t/s Q8 | 165t/s Q8 |
|
Phi-3.5 Mini 3.8B
Microsoft · 3.8B Dense
|
300t/s Q8 | 240t/s Q8 | 200t/s Q8 | 195t/s Q8 | 160t/s Q8 | 152t/s Q8 | 140t/s Q8 |
|
Gemma 3 4B
Google DeepMind · 4B Dense
|
300t/s Q8 | 240t/s Q8 | 200t/s Q8 | 195t/s Q8 | 160t/s Q8 | 152t/s Q8 | 140t/s Q8 |
|
Mistral 7B
Mistral AI · 7B Dense
|
185t/s Q8 | 140t/s Q8 | 112t/s Q8 | 108t/s Q8 | 90t/s Q8 | 85t/s Q8 | 75t/s Q8 |
|
Mathstral 7B
Mistral AI · 7B Dense
|
185t/s Q8 | 140t/s Q8 | 112t/s Q8 | 108t/s Q8 | 90t/s Q8 | 85t/s Q8 | 75t/s Q8 |
|
Qwen 2.5 7B
Alibaba · 7B Dense
|
185t/s Q8 | 140t/s Q8 | 112t/s Q8 | 108t/s Q8 | 90t/s Q8 | 85t/s Q8 | 75t/s Q8 |
|
OLMo 3 7B
AllenAI · 7B Dense
|
185t/s Q8 | 140t/s Q8 | 112t/s Q8 | 108t/s Q8 | 90t/s Q8 | 85t/s Q8 | 75t/s Q8 |
|
CodeGemma 7B
Google DeepMind · 7B Dense
|
185t/s Q8 | 140t/s Q8 | 112t/s Q8 | 108t/s Q8 | 90t/s Q8 | 85t/s Q8 | 75t/s Q8 |
|
Llama 3.1 8B
Meta · 8B Dense
|
185t/s Q8 | 135t/s Q8 | 112t/s Q8 | 108t/s Q8 | 90t/s Q8 | 85t/s Q8 | 72t/s Q4_K_M |
|
Ministral 3 8B
Mistral AI · 8B Dense
|
175t/s Q8 | 130t/s Q8 | 106t/s Q8 | 102t/s Q8 | 85t/s Q8 | 80t/s Q8 | 70t/s Q8 |
|
Granite Code 8B
IBM · 8B Dense
|
175t/s Q8 | 130t/s Q8 | 106t/s Q8 | 102t/s Q8 | 85t/s Q8 | 80t/s Q8 | 70t/s Q8 |
|
Qwen3 VL 8B
Alibaba · 8B Dense
|
175t/s Q8 | 130t/s Q8 | 106t/s Q8 | 102t/s Q8 | 85t/s Q8 | 80t/s Q8 | 70t/s Q8 |
|
GLM 4.7
Zhipu AI · 9B Dense
|
160t/s Q8 | 118t/s Q8 | 99t/s Q8 | 95t/s Q8 | 79t/s Q8 | 74t/s Q8 | 65t/s Q8 |
|
Gemma 2 9B
Google DeepMind · 9B Dense
|
160t/s Q8 | 118t/s Q8 | 99t/s Q8 | 95t/s Q8 | 79t/s Q8 | 74t/s Q8 | 65t/s Q8 |
|
Llama 3.2 Vision 11B
Meta · 11B Dense
|
140t/s Q8 | 105t/s Q8 | 88t/s Q8 | 84t/s Q8 | 70t/s Q8 | 65t/s Q8 | 67t/s Q4_K_M |
|
Mistral Nemo 12B
Mistral AI · 12B Dense
|
135t/s Q8 | 100t/s Q8 | 84t/s Q8 | 80t/s Q8 | 67t/s Q8 | 62t/s Q8 | 62t/s Q4_K_M |
|
Gemma 3 12B
Google DeepMind · 12B Dense
|
135t/s Q8 | 100t/s Q8 | 84t/s Q8 | 80t/s Q8 | 67t/s Q8 | 62t/s Q8 | 62t/s Q4_K_M |
|
Phi-4 14B
Microsoft · 14B Dense
|
130t/s Q8 | 92t/s Q8 | 70t/s Q4_K_M | 74t/s Q8 | 60t/s Q4_K_M | 60t/s Q8 | 42t/s Q4_K_M |
|
Qwen 2.5 14B
Alibaba · 14B Dense
|
125t/s Q8 | 90t/s Q8 | 83t/s Q4_K_M | 74t/s Q8 | 69t/s Q4_K_M | 60t/s Q8 | 48t/s Q4_K_M |
|
Qwen3 14B
Alibaba · 14B Dense
|
125t/s Q8 | 90t/s Q8 | 83t/s Q4_K_M | 74t/s Q8 | 69t/s Q4_K_M | 60t/s Q8 | 48t/s Q4_K_M |
|
Ministral 3 14B
Mistral AI · 14B Dense
|
125t/s Q8 | 90t/s Q8 | 83t/s Q4_K_M | 74t/s Q8 | 69t/s Q4_K_M | 60t/s Q8 | 48t/s Q4_K_M |
|
Devstral Small 2 24B
Mistral AI · 24B Dense
|
75t/s Q8 | 52t/s Q8 | 44t/s Q4_K_M | 48t/s Q4_K_M | 37t/s Q4_K_M | 44t/s Q4_K_M | 3t/s Q4_K_M |
|
Gemma 3 27B
Google DeepMind · 27B Dense
|
70t/s Q8 | 55t/s Q4_K_M | 7t/s Q4_K_M | 44t/s Q4_K_M | 5t/s Q4_K_M | 39t/s Q4_K_M | 2t/s Q4_K_M |
|
Qwen3 Coder 30B
Alibaba · 30B Dense
|
62t/s Q8 | 51t/s Q4_K_M | 5t/s Q4_K_M | 39t/s Q4_K_M | 5t/s Q4_K_M | 34t/s Q4_K_M | 2t/s Q4_K_M |
|
Nemotron 3 Nano 30B
NVIDIA · 30B Dense
|
62t/s Q8 | 51t/s Q4_K_M | 5t/s Q4_K_M | 39t/s Q4_K_M | 5t/s Q4_K_M | 34t/s Q4_K_M | 2t/s Q4_K_M |
|
Qwen3 VL 32B
Alibaba · 32B Dense
|
58t/s Q8 | 48t/s Q4_K_M | 5t/s Q4_K_M | 37t/s Q4_K_M | 4t/s Q4_K_M | 32t/s Q4_K_M | 1t/s Q4_K_M |
|
OLMo 3 32B
AllenAI · 32B Dense
|
58t/s Q8 | 48t/s Q4_K_M | 5t/s Q4_K_M | 37t/s Q4_K_M | 4t/s Q4_K_M | 32t/s Q4_K_M | 1t/s Q4_K_M |
|
Granite Code 34B
IBM · 34B Dense
|
55t/s Q8 | 46t/s Q4_K_M | 4t/s Q4_K_M | 34t/s Q4_K_M | 3t/s Q4_K_M | 30t/s Q4_K_M | 1t/s Q4_K_M |
|
Qwen3.5 35B
Alibaba · 35B Dense
|
54t/s Q8 | 44t/s Q4_K_M | 4t/s Q4_K_M | 33t/s Q4_K_M | 3t/s Q4_K_M | 29t/s Q4_K_M | 1t/s Q4_K_M |
|
Mixtral 8x7B
Mistral AI · 46B MoE (12B active)
|
150t/s Q8 | 110t/s Q8 | 28t/s Q4_K_M | 82t/s Q4_K_M | 24t/s Q4_K_M | 72t/s Q4_K_M | 8t/s Q4_K_M |
|
Llama 3.1 70B
Meta · 70B Dense
|
55t/s Q8 | 3t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
|
Llama 3.3 70B
Meta · 70B Dense
|
55t/s Q8 | 3t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
|
DeepSeek R1 Distill 70B
DeepSeek · 70B Dense
|
55t/s Q8 | 3t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
|
Cogito 70B
DeepCogito · 70B Dense
|
55t/s Q8 | 3t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
|
Qwen 2.5 72B
Alibaba · 72B Dense
|
55t/s Q4_K_M | 20t/s Q4_K_M | 6t/s Q4_K_M | 12t/s Q4_K_M | 5t/s Q4_K_M | 9t/s Q4_K_M | 2t/s Q4_K_M |
|
Qwen3 72B
Alibaba · 72B Dense
|
55t/s Q8 | 3t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
|
Llama 3.2 Vision 90B
Meta · 90B Dense
|
38t/s Q8 | 1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
1t/s Q4_K_M |
Server Only |
|
Llama 4 Scout
Meta · 109B MoE (8B active)
|
48t/s Q4_K_M | 22t/s Q4_K_M | 7t/s Q4_K_M | 14t/s Q4_K_M | 6t/s Q4_K_M | 10t/s Q4_K_M |
Server Only |
|
GPT-OSS 120B
OpenAI · 120B Dense
|
40t/s Q4_K_M | 1t/s Q4_K_M |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
Qwen3.5 122B
Alibaba · 122B Dense
|
40t/s Q4_K_M | 1t/s Q4_K_M |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
Mistral Large 2
Mistral AI · 123B Dense
|
35t/s Q4_K_M | 5t/s Q4_K_M | 2t/s Q4_K_M | 3t/s Q4_K_M | 1t/s Q4_K_M | 2t/s Q4_K_M |
Server Only |
|
Devstral 2 123B
Mistral AI · 123B Dense
|
40t/s Q4_K_M | 1t/s Q4_K_M |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
DeepSeek V2.5 236B
DeepSeek · 236B MoE (21B active)
|
18t/s IQ2_XS | 3t/s Q4_K_M |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
Llama 4 Maverick
Meta · 400B MoE (17B active)
|
8t/s IQ2_XS |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
Llama 3.1 405B
Meta · 405B Dense
|
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
DeepSeek R1
DeepSeek · 671B Dense
|
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
DeepSeek V3.1 671B
DeepSeek · 671B MoE (37B active)
|
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
Mistral Large 3 675B
Mistral AI · 675B Dense
|
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
|
DeepSeek V3
DeepSeek · 685B Dense
|
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Server Only |
Figures are approximate generation throughput (output tokens/second) measured with llama.cpp on a system with 64GB DDR5 RAM and NVMe storage. CPU offload speeds are highly dependent on system RAM bandwidth. DGX Spark figures are from NVIDIA's unified-memory architecture which eliminates the VRAM/RAM split. Results may vary by batch size, context length, and driver version.
Which GPU is Right for You?
24GB GDDR6X makes it surprisingly capable for open-source LLMs. Runs Phi-4 14B natively at Q8 and handles Qwen 2.5 72B with partial CPU offload. The best second-hand value for local AI.
GDDR7 delivers massive bandwidth improvements over the 4080 Super at the same price. Runs any 14B model natively and handles 7–8B class models at peak speed. A strong all-round choice for 2025.
32GB GDDR7 with the highest bandwidth available to consumers. Runs Qwen 2.5 72B at usable speeds with partial offload and handles 14B models faster than any other single GPU. The definitive consumer AI card.
128GB unified memory is the game-changer. Run Mistral Large 2 (123B) and Llama 4 Scout (109B) natively without any offloading. The only consumer device that can fully infer 100B+ parameter models at reasonable speeds.
DeepSeek R1, DeepSeek V3 and Llama 3.1 405B require 320–800GB of GPU memory at Q4. These models only run meaningfully on multi-GPU server configurations with NVLink — or via cloud inference APIs.
For frontier open-source models (DeepSeek R1, Llama 405B) without server hardware, cloud inference providers offer cost-effective access at a fraction of the capital outlay. Worth considering for low-frequency workloads.