Beginner's guide
How AI Language Models Work
A plain-English introduction to large language models — what they are, how they're built, and what all those scores and numbers actually mean.
01
What is an AI language model?
At its core, a large language model (LLM) is a computer programme that has learned to understand and generate human language by studying enormous amounts of text.
Think of it like this: imagine reading billions of books, articles, code repositories, and websites — and from all of that reading, you gradually develop an intuition for how language works. You start to understand grammar, facts, reasoning patterns, and how ideas connect. An LLM does exactly this, but at a scale no human could manage.
The result is a model that can answer questions, write code, summarise documents, translate languages, debug software, solve maths problems, and hold a conversation — sometimes better than many humans. It doesn't "know" things the way a database does; instead, it has learned patterns and relationships between words, ideas, and concepts.
💡
The autocomplete analogy. Your phone's autocomplete suggests the next word based on what you've typed. An LLM is essentially a vastly more powerful version of this — it predicts the most likely next word, over and over, building a coherent response one token at a time.
The term "large" in LLM refers to the number of parameters — the billions of numerical values the model adjusts during training to store its learned patterns. Modern flagship models have hundreds of billions of parameters, which is why they require powerful data-centre hardware to run.
02
How are they trained?
Building a modern AI model is a multi-stage process. Each stage teaches the model something different — from basic language understanding, to following instructions, to being safe and helpful.
1
Pre-training — learning from the internet
The model is exposed to a massive dataset — typically trillions of words scraped from websites, books, academic papers, and code repositories. It learns by trying to predict the next word in each piece of text, adjusting its internal parameters every time it gets it wrong. This phase runs on thousands of specialised chips (GPUs or TPUs) for weeks or months, costing tens of millions of pounds. After pre-training, the model understands language deeply but isn't yet particularly useful for conversations — it will just continue text rather than answer questions helpfully.
2
Supervised fine-tuning — learning to follow instructions
Human trainers write thousands of example conversations: a question and an ideal answer. The model is trained on these examples so it learns how to behave — to answer questions directly, follow instructions, and produce useful responses rather than just continuing text. This is far cheaper than pre-training but critically shapes the model's personality and capabilities.
3
RLHF — learning what humans prefer
Reinforcement Learning from Human Feedback (RLHF) is where the model learns to produce responses that humans actually prefer. Human raters compare pairs of model responses and choose the better one. A separate "reward model" learns to predict human preference scores, and then the main model is trained to maximise this reward. This is why modern models feel conversational and helpful rather than robotic — they've been shaped by millions of human preference judgements.
4
Safety alignment — guardrails and values
Finally, companies apply additional techniques to make models safer and more reliable. Anthropic's Constitutional AI has the model evaluate its own responses against a set of principles. OpenAI uses rule-based reward models. These techniques reduce harmful outputs, improve honesty, and make the model refuse requests it shouldn't fulfil — though no approach is perfect.
⚠️
Training cutoff. Every model's knowledge has a cutoff date — the point at which its training data ends. A model trained on data up to early 2024 will not know about events after that date unless it has access to live tools or search.
03
Key concepts explained
AI models come with a lot of jargon. Here's what the most important terms actually mean in plain English.
Token
The basic unit of text a model processes. A token is roughly three-quarters of a word in English — "hamburger" is two tokens, "Hello!" is two tokens. Models count their input and output in tokens, which determines cost and speed.
~750 words ≈ 1,000 tokens
Context window
The maximum amount of text a model can "see" at once — its working memory. A 128K context window fits roughly 100,000 words. Larger windows let you feed in entire codebases, long documents, or extended conversations.
128K tokens ≈ a 300-page novel
Parameters
The numerical values inside the model that store everything it has learned. More parameters generally means more capability and knowledge, but also higher cost to run. A "70B model" has 70 billion parameters.
GPT-4 estimated: ~1.8 trillion parameters
Temperature
A setting that controls how creative or deterministic the model's responses are. Low temperature (near 0) gives consistent, factual answers. High temperature (near 1–2) produces more varied and creative outputs — but also more errors.
0 = deterministic · 1 = balanced · 2 = creative
Hallucination
When a model confidently states something that isn't true. Because LLMs generate the most statistically likely text rather than looking things up, they can invent plausible-sounding facts, citations, or people that don't exist.
Always verify AI-generated facts independently
Prompt
The text you send to the model to instruct it. Writing effective prompts (prompt engineering) is a skill — providing context, examples, and clear instructions dramatically improves output quality.
"You are a helpful assistant. Summarise this in 3 bullet points: …"
Inference
The act of actually running a trained model to generate a response. Training happens once (expensively). Inference happens every time someone sends a message, which is why providers charge per token.
Input tokens + output tokens = cost per request
System prompt
A hidden set of instructions sent by the developer before the conversation begins, shaping how the model behaves. It can set a persona, define rules, or provide context the model should always remember throughout the session.
"You are a customer service agent for Acme Corp. Always be polite and concise."
Mixture of Experts (MoE)
An architecture where only a fraction of the model's parameters are activated for each token. This allows models to have a very large total parameter count but remain fast and efficient — only the relevant "expert" sub-networks engage for each query.
Llama 4, Mistral, and Grok use MoE architectures
Chain-of-thought (CoT)
A technique where the model is asked (or trained) to think step-by-step before giving a final answer. This dramatically improves performance on complex reasoning and maths tasks — models make far fewer mistakes when they "show their working".
Reasoning models like o3 and DeepSeek R1 use extended CoT
04
Types of models
Not all models are built for the same purpose. Understanding the main categories helps you pick the right tool.
🚀
Flagship models
The most capable models a provider offers. Highest benchmark scores, best reasoning and creative writing, but also the slowest and most expensive. Best for complex, high-stakes tasks where quality matters most.
GPT-4o · Claude Opus 4 · Gemini 2.5 Pro · o3
⚡
Efficient / mini models
Smaller, faster, and far cheaper than flagships. They sacrifice some capability for speed and cost. Ideal for high-volume applications, simple tasks, real-time responses, and anything price-sensitive.
GPT-4o mini · Claude Haiku · Gemini 2.0 Flash · Llama 4 Scout
🧠
Reasoning models
Specifically trained to pause and think before answering — using extended chain-of-thought. They excel at maths, science, and complex logic problems but are much slower and pricier than standard models.
OpenAI o3 · DeepSeek R1 · Claude Sonnet (extended thinking)
👁️
Multimodal models
Can process images (and sometimes audio or video) as well as text. You can send a photo and ask a question about it, have diagrams analysed, or describe what's in an image. Most modern flagship models are multimodal.
GPT-4o · Gemini 1.5 Pro · Claude Sonnet 4 · Llama 4
🔓
Open-source models
Models whose weights are publicly released, allowing anyone to download, run, modify, and fine-tune them. Great for privacy-sensitive deployments, research, and cost-free self-hosting — but require your own infrastructure.
Llama 3.1 · DeepSeek V3 · Mistral Large 2 · Qwen 2.5
🔒
Proprietary / closed models
Weights are kept private; you access them via an API only. Usually have stronger safety guardrails, dedicated infrastructure, and professional SLAs. You pay per token but don't need to manage any hardware.
GPT-4o · Claude · Gemini · Grok · Command R+
05
Reading benchmark scores
Benchmarks are standardised tests used to measure model capability. Each one targets a different skill. Here's what we track on Evalon and why each one matters.
| Benchmark |
What it tests |
Format |
Score means |
| MMLU |
Knowledge breadth — 14,000 multiple-choice questions across 57 subjects from school level to PhD: medicine, law, history, maths, ethics, and more. |
Multiple choice |
% correct — human expert ≈ 89% |
| HumanEval |
Code generation — the model must write a Python function that passes all unit tests, given only a docstring description of what the function should do. |
Write code |
% tests passed — senior dev ≈ 80% |
| MATH |
Competition-level mathematics — problems from AMC, AIME, and similar competitions requiring multi-step symbolic reasoning, proofs, and algebra. |
Free-form answer |
% correct — competition student ≈ 40–60% |
| GPQA |
Graduate-level science — 448 PhD-level questions in biology, chemistry, and physics, written by domain experts and specifically designed to stump non-experts. |
Multiple choice |
% correct — non-expert humans ≈ 34% |
| MT-Bench |
Multi-turn conversation quality — the model answers two-turn conversations across eight categories (writing, reasoning, coding, maths, etc.), then GPT-4 rates the responses. |
Conversation + GPT-4 judge |
Score 1–10 — 9+ is excellent |
| Arena ELO |
Human preference — derived from the LMSYS Chatbot Arena where millions of real users compare two anonymous models side-by-side and vote for the better response. Reflects real-world usefulness. |
Human vote |
ELO rating — 1200+ is competitive |
✅
No single benchmark tells the whole story. A model can score highly on MMLU (broad knowledge) but struggle with GPQA (deep expertise). The overall score on Evalon is a weighted composite that balances all six benchmarks, giving a more rounded view of capability.
Benchmark scores are useful but imperfect. Models can be fine-tuned specifically on benchmark-adjacent data, inflating their apparent scores without improving real-world performance. Arena ELO is often the most trustworthy signal because it's based on genuine human preferences rather than academic test sets.
06
Choosing the right model
There's no universally "best" model — only the best model for your use case. Here's a practical framework for deciding.
💻 Writing code
Look for high HumanEval scores. GPT-4.1, Claude Opus 4, and Claude Sonnet 4.6 consistently lead on coding benchmarks. For complex, multi-file projects, prefer a model with a large context window.
∑ Maths & science
Prioritise MATH and GPQA scores. o3, DeepSeek R1, and Gemini 2.5 Pro are the leaders. For step-by-step problem solving, a reasoning model that shows its working is ideal.
✍ Writing & creative tasks
MT-Bench and Arena ELO are the best proxies. Claude models are widely praised for nuanced prose. High temperature settings help with creative work.
💰 High-volume / budget
Filter by cost. GPT-4o mini, Gemini 2.0 Flash, Llama 4 Scout, and Claude Haiku offer strong performance for under $0.30/M input tokens — a fraction of flagship costs.
🔒 Privacy & self-hosting
Open-source models (Llama, DeepSeek, Mistral, Qwen) can be run entirely on your own infrastructure. Your data never leaves your servers. Useful for regulated industries like healthcare and finance.
👁 Images & documents
Filter for multimodal models. All modern GPT-4o, Gemini, and Claude Sonnet/Opus variants can read and reason about images. Gemini 1.5 Pro also supports video.
Ready to explore?
Use our tools to find and compare the right model for your needs.