What is Evalon?
Evalon is an independent resource for understanding and comparing large language models. We aggregate publicly available benchmark data, pricing information, and technical specifications to help developers, researchers, and curious people make informed decisions about which AI model is right for their use case.
We are not affiliated with any AI laboratory or model provider. All assessments are based on publicly available information and independently computed composite scores.
How we calculate the Overall Score
Our composite Overall Score combines six publicly available benchmarks into a single 0–100 figure. Each benchmark is weighted to reflect its relevance to real-world AI utility. No single benchmark can fully characterise a model's capability, so we use a balanced blend.
+ (HumanEval × 0.20)
+ (MATH × 0.15)
+ (GPQA × 0.15)
+ (MT‑Bench × 10 × 0.15)
+ (clamp((Arena ELO − 1000) ÷ 4, 0, 100) × 0.15)
MT‑Bench is scored out of 10, so it is multiplied by 10 to normalise to 0–100. Arena ELO is transformed to an approximate 0–100 scale using a linear clamp anchored at ELO 1000 (≈ score 0) and ELO 1400 (≈ score 100).
Benchmark descriptions & sources
All benchmark figures are sourced from official model technical reports, independent evaluation papers, and the Chatbot Arena leaderboard. Where multiple reported values exist, we use the most widely cited result.
Tests knowledge across 57 academic subjects including mathematics, history, law, medicine, and computer science. Scored as percentage of correct multiple-choice answers.
arxiv.org/abs/2009.03300 ↗Measures code generation ability. Models must write correct Python functions from natural language docstrings. Pass@1 rate reported as a percentage.
arxiv.org/abs/2107.03374 ↗A dataset of 12,500 competition maths problems spanning algebra, geometry, number theory, and calculus. Requires multi-step symbolic reasoning and derivation.
arxiv.org/abs/2103.03874 ↗Expert-validated questions in biology, chemistry, and physics that require genuine scientific understanding. Human experts without specialist knowledge score ~34%.
arxiv.org/abs/2311.12022 ↗80 multi-turn questions across writing, roleplay, reasoning, maths, coding, extraction, STEM, and humanities. Scored 1–10 by GPT-4 as an independent judge.
arxiv.org/abs/2306.05685 ↗ELO ratings derived from hundreds of thousands of blind human preference votes in head-to-head model battles on the LMSYS Chatbot Arena platform. Reflects real-world user preference.
chat.lmsys.org ↗Where our data comes from
Model data is compiled from a combination of official provider documentation, independent research publications, and community evaluations.
News feed sources
Our News page aggregates RSS feeds from the following publications. We do not host or reproduce article content — all articles link directly to the original source.
Benchmark scores are snapshots in time and may not reflect the current state of a model following updates, fine-tuning, or system prompt changes by providers. Performance can vary significantly depending on task type, prompt formulation, and evaluation methodology.
Pricing data is indicative and may have changed since publication. Always verify current pricing with the provider before making commercial decisions.
Evalon is provided for informational purposes only. We make no warranties regarding the accuracy or completeness of any data presented. All trademarks and model names are the property of their respective owners.