← Back to all models
L

Llama 3.2 Vision 11B

by Meta
Efficient Free & Open Source Multimodal 🏆 Ranked #72 of 85
61.2
Overall Score
out of 100
About

Meta's compact vision-language model supporting image understanding and multimodal conversations. Runs on a single consumer GPU with 12GB VRAM and offers solid visual question answering capabilities.

Key Metrics
Context Window
128K
tokens
Avg Response
580
milliseconds
Input Cost
$0.08
per million tokens
Output Cost
$0.08
per million tokens
Arena ELO
1175
Chatbot Arena rating
MT-Bench
8.1
out of 10
Benchmark Scores
MMLU
73.0%
HumanEval
72.0%
MATH
58.0%
GPQA
32.0%
MT-Bench
81.0/10
Capability Profile
Strengths & Limitations
Strengths
✓ Vision-language ✓ Open source ✓ Consumer GPU friendly ✓ Fast ✓ Good for size
Limitations
⚠ Limited complex visual reasoning ⚠ Below larger VL models ⚠ 12GB VRAM required
Ideal Use Cases
Image Q&A Visual assistants Document understanding Multimodal chatbots Research
Model Details
Provider Meta
Released 2024-09-25
Type Free & Open Source
Multimodal Yes
Tier Efficient
Global rank #72 / 85
Pricing (USD)
Input tokens $0.08/M
Output tokens $0.08/M
Per 1,000 tokens ≈ $0.0001 input / $0.0001 output
All Benchmarks
MMLU 73.0%
HumanEval 72.0%
MATH 58.0%
GPQA 32.0%
MT-Bench 8.1/10
Arena ELO 1175
Compare this model View Rankings

You might also consider