About this site

What is Evalon?

Evalon is an independent resource for understanding and comparing large language models. We aggregate publicly available benchmark data, pricing information, and technical specifications to help developers, researchers, and curious people make informed decisions about which AI model is right for their use case.

We are not affiliated with any AI laboratory or model provider. All assessments are based on publicly available information and independently computed composite scores.

Models tracked

Providers covered

Benchmarks used

News sources

Methodology

How we calculate the Overall Score

Our composite Overall Score combines six publicly available benchmarks into a single 0–100 figure. Each benchmark is weighted to reflect its relevance to real-world AI utility. No single benchmark can fully characterise a model's capability, so we use a balanced blend.

Score formula

score = (MMLU × 0.20)
      + (HumanEval × 0.20)
      + (MATH × 0.15)
      + (GPQA × 0.15)
      + (MT‑Bench × 10 × 0.15)
      + (clamp((Arena ELO − 1000) ÷ 4, 0, 100) × 0.15)

MT‑Bench is scored out of 10, so it is multiplied by 10 to normalise to 0–100. Arena ELO is transformed to an approximate 0–100 scale using a linear clamp anchored at ELO 1000 (≈ score 0) and ELO 1400 (≈ score 100).

Benchmarks

Benchmark descriptions & sources

All benchmark figures are sourced from official model technical reports, independent evaluation papers, and the Chatbot Arena leaderboard. Where multiple reported values exist, we use the most widely cited result.

📚

MMLU

Massive Multitask Language Understanding

Weight: 20%

Tests knowledge across 57 academic subjects including mathematics, history, law, medicine, and computer science. Scored as percentage of correct multiple-choice answers.

arxiv.org/abs/2009.03300 ↗

💻

HumanEval

OpenAI Code Generation Benchmark

Weight: 20%

Measures code generation ability. Models must write correct Python functions from natural language docstrings. Pass@1 rate reported as a percentage.

arxiv.org/abs/2107.03374 ↗

🔢

MATH

Competition Mathematics Benchmark

Weight: 15%

A dataset of 12,500 competition maths problems spanning algebra, geometry, number theory, and calculus. Requires multi-step symbolic reasoning and derivation.

arxiv.org/abs/2103.03874 ↗

🔬

GPQA

Graduate-Level Google-Proof Q&A

Weight: 15%

Expert-validated questions in biology, chemistry, and physics that require genuine scientific understanding. Human experts without specialist knowledge score ~34%.

arxiv.org/abs/2311.12022 ↗

💬

MT-Bench

Multi-Turn Conversation Quality

Weight: 15%

80 multi-turn questions across writing, roleplay, reasoning, maths, coding, extraction, STEM, and humanities. Scored 1–10 by GPT-4 as an independent judge.

arxiv.org/abs/2306.05685 ↗

🏆

Arena ELO

Chatbot Arena Human Preference Rating

Weight: 15%

ELO ratings derived from hundreds of thousands of blind human preference votes in head-to-head model battles on the LMSYS Chatbot Arena platform. Reflects real-world user preference.

chat.lmsys.org ↗

Data Sources

Where our data comes from

Model data is compiled from a combination of official provider documentation, independent research publications, and community evaluations.

📄

Official technical reports

Primary benchmark results as reported in each provider's model card, blog post, or technical paper at time of release.

Visit ↗

🏟️

LMSYS Chatbot Arena

ELO ratings sourced from the public Chatbot Arena leaderboard, reflecting aggregated human preference votes.

Visit ↗

📊

Papers With Code

Supplementary benchmark results and state-of-the-art tracking across standardised evaluation datasets.

Visit ↗

🔍

Artificial Analysis

Independent third-party model performance and pricing benchmarks including latency and throughput measurements.

Visit ↗

💰

Provider API documentation

Pricing, context window sizes, and capability flags sourced directly from provider API documentation and pricing pages.

🤗

Community evaluations

Supplementary data from community-run evaluations on Hugging Face Open LLM Leaderboard and similar platforms.

Visit ↗

News Aggregation

News feed sources

Our News page aggregates RSS feeds from the following publications. We do not host or reproduce article content — all articles link directly to the original source.

VentureBeat ↗ TechCrunch ↗ The Verge ↗ MIT Technology Review ↗ Ars Technica ↗

⚠ Important disclaimer

Benchmark scores are snapshots in time and may not reflect the current state of a model following updates, fine-tuning, or system prompt changes by providers. Performance can vary significantly depending on task type, prompt formulation, and evaluation methodology.

Pricing data is indicative and may have changed since publication. Always verify current pricing with the provider before making commercial decisions.

Evalon is provided for informational purposes only. We make no warranties regarding the accuracy or completeness of any data presented. All trademarks and model names are the property of their respective owners.