What is Evalon?

Evalon is an independent resource for understanding and comparing large language models. We aggregate publicly available benchmark data, pricing information, and technical specifications to help developers, researchers, and curious people make informed decisions about which AI model is right for their use case.

We are not affiliated with any AI laboratory or model provider. All assessments are based on publicly available information and independently computed composite scores.

Models tracked
22
Providers covered
9
Benchmarks used
6
News sources
5

How we calculate the Overall Score

Our composite Overall Score combines six publicly available benchmarks into a single 0–100 figure. Each benchmark is weighted to reflect its relevance to real-world AI utility. No single benchmark can fully characterise a model's capability, so we use a balanced blend.

Score formula
score = (MMLU × 0.20)
      + (HumanEval × 0.20)
      + (MATH × 0.15)
      + (GPQA × 0.15)
      + (MT‑Bench × 10 × 0.15)
      + (clamp((Arena ELO − 1000) ÷ 4, 0, 100) × 0.15)

MT‑Bench is scored out of 10, so it is multiplied by 10 to normalise to 0–100. Arena ELO is transformed to an approximate 0–100 scale using a linear clamp anchored at ELO 1000 (≈ score 0) and ELO 1400 (≈ score 100).

Benchmark descriptions & sources

All benchmark figures are sourced from official model technical reports, independent evaluation papers, and the Chatbot Arena leaderboard. Where multiple reported values exist, we use the most widely cited result.

📚
MMLU
Massive Multitask Language Understanding
Weight: 20%

Tests knowledge across 57 academic subjects including mathematics, history, law, medicine, and computer science. Scored as percentage of correct multiple-choice answers.

arxiv.org/abs/2009.03300 ↗
💻
HumanEval
OpenAI Code Generation Benchmark
Weight: 20%

Measures code generation ability. Models must write correct Python functions from natural language docstrings. Pass@1 rate reported as a percentage.

arxiv.org/abs/2107.03374 ↗
🔢
MATH
Competition Mathematics Benchmark
Weight: 15%

A dataset of 12,500 competition maths problems spanning algebra, geometry, number theory, and calculus. Requires multi-step symbolic reasoning and derivation.

arxiv.org/abs/2103.03874 ↗
🔬
GPQA
Graduate-Level Google-Proof Q&A
Weight: 15%

Expert-validated questions in biology, chemistry, and physics that require genuine scientific understanding. Human experts without specialist knowledge score ~34%.

arxiv.org/abs/2311.12022 ↗
💬
MT-Bench
Multi-Turn Conversation Quality
Weight: 15%

80 multi-turn questions across writing, roleplay, reasoning, maths, coding, extraction, STEM, and humanities. Scored 1–10 by GPT-4 as an independent judge.

arxiv.org/abs/2306.05685 ↗
🏆
Arena ELO
Chatbot Arena Human Preference Rating
Weight: 15%

ELO ratings derived from hundreds of thousands of blind human preference votes in head-to-head model battles on the LMSYS Chatbot Arena platform. Reflects real-world user preference.

chat.lmsys.org ↗

Where our data comes from

Model data is compiled from a combination of official provider documentation, independent research publications, and community evaluations.

📄
Official technical reports
Primary benchmark results as reported in each provider's model card, blog post, or technical paper at time of release.
Visit ↗
🏟️
LMSYS Chatbot Arena
ELO ratings sourced from the public Chatbot Arena leaderboard, reflecting aggregated human preference votes.
Visit ↗
📊
Papers With Code
Supplementary benchmark results and state-of-the-art tracking across standardised evaluation datasets.
Visit ↗
🔍
Artificial Analysis
Independent third-party model performance and pricing benchmarks including latency and throughput measurements.
Visit ↗
💰
Provider API documentation
Pricing, context window sizes, and capability flags sourced directly from provider API documentation and pricing pages.
🤗
Community evaluations
Supplementary data from community-run evaluations on Hugging Face Open LLM Leaderboard and similar platforms.
Visit ↗

News feed sources

Our News page aggregates RSS feeds from the following publications. We do not host or reproduce article content — all articles link directly to the original source.

VentureBeat TechCrunch The Verge MIT Technology Review Ars Technica
⚠ Important disclaimer

Benchmark scores are snapshots in time and may not reflect the current state of a model following updates, fine-tuning, or system prompt changes by providers. Performance can vary significantly depending on task type, prompt formulation, and evaluation methodology.

Pricing data is indicative and may have changed since publication. Always verify current pricing with the provider before making commercial decisions.

Evalon is provided for informational purposes only. We make no warranties regarding the accuracy or completeness of any data presented. All trademarks and model names are the property of their respective owners.