In 2020, OpenAI fed 300 billion words into a neural network with 175 billion tunable parameters and spent millions of dollars on compute. Out came GPT-3, a system that could write poetry, debug code, and explain quantum physics. Nobody explicitly programmed any of those abilities. They emerged from a single, deceptively simple training objective: predict the next token. That one idea powers every large language model you've ever used. Claude Opus 4.6, GPT-5.3-Codex, Gemini 3.1 Pro, DeepSeek V4: at their mathematical core, all of them are autocomplete engines of extraordinary sophistication. They don't "know" things. They calculate the statistical plausibility of what comes next, and they do it so well that the output feels like understanding.
This article traces the full pipeline: how raw text becomes numbers, how those numbers flow through the Transformer architecture, and how a probability distribution over vocabulary tokens becomes the sentence you're reading right now. We'll use a single running example throughout: the input sentence "Large language models predict the next token" as it moves through each stage of the system.
From Text to Numbers: Tokenization and Embeddings
Neural networks operate on numbers, not characters. Converting raw text into numerical representations happens in two distinct steps: splitting text into tokens, then mapping each token to a dense vector.
Tokenization Splits Text into Subword Units
Before the model sees "Large language models predict," the text is split into chunks called tokens. A token might be a full word, a word fragment, or a single character. Most production LLMs use Byte-Pair Encoding (BPE), an algorithm that iteratively merges the most frequent character pairs into subword units. For a deep dive into BPE, WordPiece, and SentencePiece, see our article on tokenization.
Here's how tokens map in practice:
- Simple word: "apple" maps to 1 token
- Complex word: "unbelievable" maps to "un" + "believ" + "able" (3 tokens)
- Code:
print(x)maps toprint+(+x+)(4 tokens)
Why does this matter? LLMs charge per token, think per token, and have a maximum number of tokens they can hold in memory. Everything downstream depends on this step.
BPE isn't the only game anymore. Meta's Byte Latent Transformer (BLT), published in late 2025, demonstrated that byte-level models can match tokenized models at scale, eliminating the vocabulary bottleneck entirely. Meanwhile, LiteToken (February 2026) removes intermediate merge residues that standard BPE leaves behind, producing cleaner subword boundaries. The field is still actively debating whether subword tokenization is an engineering compromise or a genuine architectural strength.
The following code block shows BPE in action on a small corpus, so you can see how merge operations build a vocabulary from raw characters:
Corpus: low low low low low lowest lowest newer newer newer newer newer newer wider wider wider
BPE merge operations (most frequent pair merged first):
Step Pair Count New Token
-------------------------------------------------
1 "e" + "r" 9 er
2 "er" + "</w>" 9 er</w>
3 "l" + "o" 7 lo
4 "lo" + "w" 7 low
5 "n" + "e" 6 ne
6 "ne" + "w" 6 new
7 "new" + "er</w>" 6 newer</w>
8 "low" + "</w>" 5 low</w>
9 "w" + "i" 3 wi
10 "wi" + "d" 3 wid
Final vocabulary (after 10 merges):
newer</w> (count: 6)
low</w> (count: 5)
wid er</w> (count: 3)
low e s t </w> (count: 2)
Key Insight: Notice how BPE discovered that "er" and "low" are useful building blocks shared across multiple words. The algorithm never saw these as meaningful morphemes; it found them purely through frequency statistics. This is exactly how GPT-4's tokenizer was built, just scaled to billions of text bytes.
Embeddings Map Tokens to Meaning Vectors
Once tokenized, each token is mapped to an embedding vector: a list of numbers (4,096 dimensions in GPT-4, 12,288 in GPT-5) that represents that token's meaning in a high-dimensional space. These aren't random numbers. They're learned during training so that semantically similar words cluster together. For a thorough treatment of how these vectors power search, retrieval, and similarity, see our guide on text embeddings.
The classic example still holds:
- "Cat" is close to "Dog"
- "Paris" is close to "France"
- "King" minus "Man" plus "Woman" approximates "Queen"
That last one isn't a trick. Vector arithmetic genuinely works because the embedding space encodes semantic relationships as geometric directions. Techniques like PCA can compress these high-dimensional vectors for visualization, revealing how cleanly concepts separate.
Positional Encoding Teaches Word Order
The Transformer processes all tokens simultaneously, not left to right. So how does it know that "Dog bites man" and "Man bites dog" mean different things? Positional encoding adds a unique mathematical signal to each token's embedding that encodes its position in the sequence.
The original Transformer (Vaswani et al., 2017) used fixed sine and cosine functions at different frequencies. Modern LLMs have moved to Rotary Position Embeddings (RoPE), which encode relative positions through rotation matrices and scale to the million-token long context windows we see in 2026. Another approach, ALiBi (Attention with Linear Biases), skips encoding positions in embeddings entirely and instead biases attention scores based on token distance, achieving better extrapolation beyond the training window length.
The Transformer Architecture
Before 2017, language models used Recurrent Neural Networks (RNNs) that processed words sequentially, like reading a book left to right. This was painfully slow and caused a fundamental problem: by the time the model reached the end of a long passage, it had largely forgotten the beginning.
The 2017 paper "Attention Is All You Need" (Vaswani et al.) introduced the Transformer, an architecture that processes the entire sequence in parallel. This single innovation made modern LLMs possible.
Click to expandTransformer architecture showing encoder-decoder structure with attention layers
A Transformer is built by stacking identical blocks called layers. GPT-3 has 96 layers. Each layer contains two main components: a self-attention mechanism and a feed-forward network, held together by residual connections and layer normalization.
| Component | Role | Parameter Share |
|---|---|---|
| Self-Attention | Mixes information across tokens | ~1/3 of params |
| Feed-Forward Network | Processes each token independently | ~2/3 of params |
| Layer Norm | Stabilizes activations | Negligible |
| Residual Connections | Preserves gradient flow | Zero params |
Self-Attention: How the Model Understands Context
Self-attention is the mathematical core of every Transformer. It answers one question: for each token, how relevant is every other token in the sequence?
Consider our running example: "Large language models predict the next token." When the model processes "predict," it needs to know that the subject is "models" (not "language") and that the object coming up is "token." Self-attention computes relevance scores between all pairs of tokens to build this context.
It works through three learned linear projections: Query (Q), Key (K), and Value (V).
- Query (): What is this token looking for? ("predict" asks: "what is my subject?")
- Key (): What does this token offer? ("models" answers: "I'm a noun that does things")
- Value (): The actual content to pass forward when there's a match
The attention score formula:
Where:
- is the query matrix (each row is one token's query vector)
- is the key matrix (each row is one token's key vector)
- is the value matrix (each row is one token's value vector)
- is the dimensionality of the key vectors
- produces a matrix of compatibility scores between all token pairs
- is a scaling factor that prevents dot products from growing too large
- converts raw scores into probabilities that sum to 1 per row
In Plain English: Each token broadcasts a query ("what am I looking for?") and a key ("what do I represent?"). The dot product between a query and a key measures compatibility. In our sentence, the query from "predict" and the key from "models" should produce a high score because verbs attend strongly to their subjects. We scale by because large dot products push softmax into regions where gradients vanish. Finally, we use the resulting attention weights to blend all value vectors into a single contextualized representation for each token.
Click to expandSelf-attention mechanism showing query, key, value computation
Multi-Head Attention runs this process multiple times in parallel (96 heads in GPT-3, 128 in GPT-4). Each head can learn a different type of relationship: one captures subject-verb agreement, another tracks coreference, another picks up on sentiment. The results are concatenated and projected back to the model dimension.
Pro Tip: Think of multi-head attention like a panel of specialist editors reviewing the same paragraph simultaneously. One checks grammar, one checks factual consistency, one checks tone. Their combined feedback produces a richer understanding than any single reviewer could achieve alone.
Here's the complete self-attention computation built from scratch with NumPy:
Tokens: ['Large', 'language', 'models', 'predict']
Attention weights (each row = how one token attends to all others):
Large -> [0.253, 0.263, 0.228, 0.255]
language -> [0.232, 0.249, 0.269, 0.250]
models -> [0.239, 0.219, 0.285, 0.257]
predict -> [0.281, 0.259, 0.216, 0.245]
Output shape: (4, 8)
First token output (contextual embedding): [-0.0857 0.0838 0.0684 -0.0736 0.222 0.1581 -0.187 0.0539]
Key Insight: With random (untrained) weights, attention distributes nearly uniformly across all tokens. After billions of training examples, these weights sharpen dramatically: "predict" would put 60%+ of its attention on "models" and much less on "Large." Training is what teaches the Q, K, V projections to produce meaningful compatibility scores.
Feed-Forward Networks Store Factual Knowledge
After attention mixes information across tokens, each token passes independently through a Feed-Forward Network (FFN): two linear transformations with a nonlinear activation between them. These FFN layers contain roughly two-thirds of the model's total parameters and are where factual knowledge appears to be stored. Recent mechanistic interpretability research suggests individual neurons in these layers activate for specific concepts, like a "Paris is the capital of France" neuron.
Residual Connections and Layer Normalization
Two architectural choices make deep Transformers trainable at all. Residual connections add each layer's input directly to its output (a skip connection), ensuring that if a layer isn't useful for a given input, the original signal passes through unmodified. Layer normalization rescales activations to a stable range, preventing numerical explosion during training.
Stack self-attention, FFN, residuals, and layer norm together, and you get one Transformer block. GPT-3 stacks 96 of these blocks. GPT-4 uses a Mixture-of-Experts variant with an estimated 1.8 trillion total parameters, though only a fraction activate per token.
Mixture of Experts: Scaling Without Proportional Cost
The straightforward way to make a model smarter is to add more parameters. But inference cost scales linearly with active parameters, so a 1-trillion-parameter dense model would be prohibitively expensive to run.
Mixture of Experts (MoE) solves this by replacing each FFN layer with multiple "expert" sub-networks and a lightweight router that selects which experts process each token. Only 2 to 8 experts activate per token, while the rest stay idle.
| Model | Total Parameters | Active per Token | Architecture |
|---|---|---|---|
| GPT-3 | 175B | 175B (dense) | Standard Transformer |
| GPT-4 | ~1.8T | ~280B | MoE (estimated) |
| DeepSeek V4 | ~1T | ~32B | MoE on Huawei Ascend |
| Qwen3-235B | 235B | 22B | MoE |
DeepSeek V4, dropping in March 2026, pushes MoE further: roughly 1 trillion total parameters with only ~32B active per token, trained entirely on Huawei Ascend chips without any NVIDIA hardware. This demonstrates that MoE isn't just an optimization trick; it's now the dominant architecture for frontier models. Even open-source vs closed-source LLMs are converging on MoE as the standard architecture.
Common Pitfall: When people quote a model's parameter count, ask whether it's total or active. DeepSeek V4's ~1T total sounds enormous, but its ~32B active parameter footprint means inference cost is comparable to a much smaller dense model. Always compare active parameters when estimating serving costs.
Text Generation: From Probabilities to Words
The Transformer can now produce a contextualized representation for every token in the input. But how does it generate new text?
The answer is an autoregressive loop: the model predicts one token, appends it to the input, and repeats. When you see an LLM typing a response word by word, that's literally what's happening. Each token is a separate forward pass through the entire model.
The final layer outputs a vector of logits: raw, unnormalized scores for every token in the vocabulary (typically 50,000 to 100,000 entries). A softmax function converts these into a probability distribution:
Where:
- is the probability assigned to token
- is the raw logit (unnormalized score) for token
- is the vocabulary size (total number of candidate tokens)
- is the exponential function, which amplifies differences between scores
In Plain English: If our model is completing "Large language models predict the next ___" and the logit for "token" is 8.0 while "pizza" scores 2.0, the exponential function blows up that 6-point gap into a massive probability difference. Softmax then normalizes everything so all probabilities sum to 1.0, making "token" overwhelmingly likely.
How we pick from this distribution shapes the model's personality. Greedy decoding always takes the highest-probability token (precise but repetitive). Temperature controls randomness: values below 1.0 sharpen the distribution (more deterministic), values above 1.0 flatten it (more creative). Top-P (nucleus sampling) considers only the smallest set of tokens whose cumulative probability exceeds a threshold, trimming the long tail of unlikely options.
The Training Pipeline: Three Stages
Training an LLM is a multi-stage process, each stage building on the last. The total cost for a frontier model in March 2026 ranges from $30M to over $100M in compute alone.
Click to expandLLM training pipeline from pre-training through RLHF alignment
Stage 1: Pre-training Teaches Raw Language
The model reads trillions of tokens from web crawls, books, code repositories, and scientific papers with one objective: predict the next token. Given "Large language models predict the next," the correct answer is "token."
The model minimizes cross-entropy loss, the gap between its predicted probability distribution and the actual next token. Over trillions of examples, the weights gradually encode grammar, facts, reasoning patterns, and coding conventions, all as a side effect of next-token prediction.
| Model | Parameters | Training Tokens | Estimated Cost |
|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | ~$5M |
| Llama 3.1 (2024) | 405B | 15T | ~$30M |
| GPT-4 (2023) | ~1.8T (MoE) | ~13T | ~$100M+ |
| DeepSeek V4 (2026) | ~1T (MoE) | est. 20T+ | Unknown |
Key Insight: After pre-training, the model is a brilliant text completer but a terrible assistant. Ask it "What's the capital of France?" and it might continue with "What's the capital of Germany? What's the capital of Spain?" because it's completing a list of questions, not answering yours. That's what the next two stages fix.
Stage 2: Supervised Fine-Tuning Teaches the Format
Human annotators provide thousands of high-quality (instruction, response) pairs. "When asked a question, give an answer. When asked to summarize, produce a summary." This stage doesn't teach the model new knowledge; it teaches the model to use its existing knowledge in a helpful format.
Stage 3: Alignment Teaches Human Preferences
The original approach, RLHF (Reinforcement Learning from Human Feedback), works in two steps. First, humans rank multiple model responses, and a Reward Model learns to score outputs the way humans would. Then Proximal Policy Optimization (PPO) trains the LLM to maximize that reward score.
But RLHF is expensive and unstable. The field has moved fast:
| Method | Year | Key Advantage |
|---|---|---|
| RLHF (PPO) | 2022 | Original approach, proven at scale |
| DPO | 2023 | Eliminates explicit reward model entirely |
| RLAIF | 2024 | Uses AI feedback; <$0.01 per data point vs $1+ for humans |
| GRPO | 2025 | No reward model, no PPO; used by DeepSeek |
Direct Preference Optimization (DPO) reformulates the RLHF objective so the language model itself acts as the reward model, cutting training complexity roughly in half. DeepSeek's Group Relative Policy Optimization (GRPO) goes further: it needs neither a reward model nor PPO, instead comparing groups of the model's own outputs to determine which response patterns to reinforce.
For models that go beyond pattern matching into explicit multi-step reasoning, see our coverage of reasoning models.
What LLMs Cannot Do
Understanding the limits matters as much as understanding the capabilities. Here's where the next-token prediction framework breaks down.
Hallucination. LLMs optimize for plausibility, not truth. They'll confidently generate a fake citation that looks perfect because "Author (Year). Title. Journal." is a statistically common pattern, regardless of whether that paper exists. This is an architectural limitation, not a bug that will be patched away.
No real-time knowledge. The model's knowledge is frozen at training time. It cannot access the internet, check databases, or learn from your conversation (unless explicitly augmented). This is why RAG exists: to ground LLM responses in current, retrieved evidence.
Pattern matching, not reasoning. When an LLM solves a math problem, it recognizes patterns from millions of similar problems in its training data. Change the numbers slightly outside the training distribution, and performance degrades. Chain-of-thought prompting and context engineering help, but they don't give the model genuine mathematical reasoning.
Inference cost. Every token generated requires a full forward pass through billions of active parameters. Techniques like speculative decoding (using a small draft model to propose tokens that the large model verifies in parallel) achieve up to 3x faster inference. Quantization-Aware Distillation, as used in NVIDIA's Nemotron models, achieves 4x throughput improvements. But serving a frontier LLM at scale still costs serious money.
The Frontier: State Space Models and Beyond
Transformers have a fundamental efficiency problem: self-attention scales as with sequence length , because every token must attend to every other token. For a 1-million-token context window, that's a trillion attention computations.
State Space Models (SSMs), particularly the Mamba architecture (Gu and Dao, 2023), offer an alternative. Mamba achieves training complexity and linear-time inference, making it dramatically cheaper for long sequences. The catch is that pure SSMs sometimes underperform Transformers on tasks requiring precise recall over long contexts.
The current bet in the field is hybrid architectures. MoE-Mamba combines SSM efficiency with expert routing, attempting to get the best of both worlds. Whether SSMs will eventually replace Transformers or remain a complementary approach is one of the most actively debated questions in the field as of March 2026.
When to Use LLMs (and When Not To)
Not every problem needs a 100-billion-parameter model. Here's a practical decision framework:
Use an LLM when:
- The task requires broad world knowledge or language understanding
- Output quality matters more than latency (content generation, analysis, coding)
- You need flexibility across many task types without training task-specific models
- You can tolerate occasional errors and have a way to verify outputs
Skip the LLM when:
- The task is a simple lookup, classification, or regex match
- Sub-10ms latency is required (LLMs typically need 100ms+ per token)
- You need guaranteed correctness (medical dosing, financial calculations)
- Your data is highly structured and a traditional ML model achieves 99%+ accuracy
- Cost per query matters and volume is in the millions per day
Pro Tip: Many production systems use a cascade: a fast, cheap classifier handles 90% of requests, and only the ambiguous 10% get routed to an LLM. This cuts costs by an order of magnitude while maintaining quality where it counts.
Conclusion
Large language models are not magic, and they are not sentient. They are extraordinarily powerful statistical engines trained on an objective so simple it fits in one line: predict the next token. From that single idea emerges tokenization, embeddings, self-attention, feed-forward layers, and autoregressive generation, an architecture that captures enough of the structure of human language to be genuinely useful.
The practical implication is direct: once you understand that LLMs are probability machines, you stop being surprised by their failure modes. Hallucinations happen because plausible text gets high probability regardless of truth. Context engineering works because it shapes the probability distribution the model conditions on. RAG helps because it injects real evidence into that conditioning context.
The field is moving fast. MoE architectures are now standard. DPO and GRPO are replacing RLHF. State space models are challenging the Transformer's dominance for long sequences. But the core mechanism, next-token prediction through self-attention, remains the beating heart of every frontier model in March 2026. Master that mechanism, and everything else is details. For the next step, explore how text embeddings power semantic search, or jump into reasoning models to see what happens when LLMs learn to think step by step.
Interview Questions
Q: Explain the self-attention mechanism in Transformers. Why is it important?
Self-attention computes a weighted sum of all token representations in a sequence, where the weights are determined by the compatibility between each token's query and every other token's key. It's important because it allows the model to capture long-range dependencies in a single operation, unlike RNNs which degrade over distance. The scaling prevents gradient instability as embedding dimensions grow.
Q: What is the difference between pre-training and fine-tuning in LLMs?
Pre-training teaches the model general language patterns by predicting the next token over trillions of examples from web text, books, and code. Fine-tuning then adapts this general model to a specific format (instruction-following) or domain (medical, legal) using a much smaller, curated dataset. Pre-training is expensive ($5M to $100M+); fine-tuning typically costs orders of magnitude less.
Q: Why do LLMs hallucinate, and how would you mitigate this in production?
LLMs optimize for text plausibility, not factual accuracy. A statistically likely continuation of "The 2024 Nobel Prize in Physics was awarded to" will be confident regardless of correctness. In production, the standard mitigation is Retrieval-Augmented Generation (RAG), which grounds model responses in retrieved evidence from a verified knowledge base. Additional safeguards include confidence calibration, output verification chains, and human-in-the-loop review for high-stakes outputs.
Q: What is Mixture of Experts (MoE), and why is it dominant in frontier models?
MoE replaces the standard feed-forward layer with multiple expert sub-networks and a router that selects which experts process each token. This allows a model to have trillions of total parameters (for capacity) while only activating a fraction per token (for efficiency). DeepSeek V4, for example, has ~1T total parameters but activates only ~32B per token, achieving frontier performance at a fraction of the inference cost of a dense model of equivalent total size.
Q: A user reports that your LLM-based product gives inconsistent answers to the same question. What's happening?
The model's output is sampled from a probability distribution, so with temperature above zero, different tokens get selected on different runs. To fix inconsistency: lower the temperature toward 0 for deterministic outputs, set a fixed random seed if the API supports it, or use greedy decoding. If consistency is critical (like a customer-facing FAQ), consider caching responses for identical inputs rather than calling the model each time.
Q: Compare DPO and RLHF. When would you choose one over the other?
RLHF trains a separate reward model on human preferences and then uses PPO to optimize the LLM against that reward signal. DPO eliminates the reward model by reformulating the objective so the LLM itself implicitly acts as the reward model, reducing training complexity and instability. Choose DPO when you want simpler training infrastructure and have clean preference pair data. Choose RLHF when you need fine-grained control over the reward signal or when your preference data is noisy and benefits from a dedicated reward model.
Q: Why does tokenization matter for LLM performance, and what are the tradeoffs of different vocabulary sizes?
Tokenization determines the fundamental units the model operates on. A small vocabulary (few merges) means more tokens per sentence, increasing compute cost and potentially exceeding context limits. A large vocabulary (many merges) reduces sequence length but increases the embedding table size and makes rare tokens harder to learn well. Most production models settle on 50K to 100K tokens as a balance. The choice also affects multilingual performance: vocabularies trained on English-heavy data may split non-English text into many small tokens, degrading quality.