In 2020, OpenAI fed 300 billion words into a neural network with 175 billion tunable knobs, spent millions of dollars on compute, and out came GPT-3 — a system that could write poetry, debug code, and explain quantum physics. Nobody explicitly programmed any of those abilities. They emerged from a single, deceptively simple training objective: predict the next word.
That one idea — next-token prediction — is the engine behind every Large Language Model you've used. GPT-5, Claude, Gemini, Llama — all of them are, at their mathematical core, autocomplete engines of extraordinary sophistication. They don't "know" things. They calculate the statistical plausibility of what comes next, and they do it so well that the output feels like understanding.
This article traces the full journey: how raw text becomes numbers, how those numbers flow through the Transformer architecture, and how a probability distribution becomes the sentence you're reading right now.
From text to numbers: tokenization and embeddings
Computers don't read words. They read numbers. So the first job is converting text into a format a neural network can process. This happens in two steps.
Tokenization: chopping text into pieces
Before the model sees "Hello world," the text is split into chunks called tokens. A token can be a word, part of a word, or even a single character. Most modern LLMs use Byte-Pair Encoding (BPE), which finds the most efficient way to represent text by merging frequently occurring character pairs.
- Simple word: "apple" → 1 token
- Complex word: "unbelievable" → "un", "believ", "able" (3 tokens)
- Code:
print(x)→print,(,x,)(4 tokens)
Why does this matter? Because LLMs charge per token, think per token, and have a maximum number of tokens they can hold in memory. Everything downstream depends on this step. For a deeper look at how BPE, WordPiece, and SentencePiece work under the hood, see our upcoming article on Tokenization Deep Dive.
Embeddings: giving words meaning through geometry
Once tokenized, each token is mapped to a vector — a list of numbers (e.g., 4,096 numbers in GPT-4) that represents that token's meaning in a high-dimensional space. These aren't random numbers. They're learned during training so that words with similar meanings end up close together.
- "Cat" is close to "Dog"
- "Paris" is close to "France"
- "King" - "Man" + "Woman" ≈ "Queen" (vector arithmetic actually works)
Key Insight: This geometric representation is why LLMs can handle synonyms, analogies, and even languages they weren't explicitly taught well — similar concepts cluster in similar regions of the vector space.
To understand how these vectors power search and retrieval, see our upcoming article on Text Embeddings.
Positional encoding: teaching order to a parallel machine
Here's a problem: the Transformer processes all tokens simultaneously, not left-to-right. So how does it know that "Dog bites man" and "Man bites dog" mean different things? The answer is positional encoding — we add a unique mathematical signal to each token's embedding that encodes its position in the sequence.
The original Transformer used sine and cosine functions at different frequencies. Modern LLMs like Llama use Rotary Position Embeddings (RoPE), which encode relative positions more efficiently and enable the million-token context windows we see in 2026.
The Transformer: the architecture that changed everything
Before 2017, language models used Recurrent Neural Networks (RNNs), which processed words one at a time like reading a book left-to-right. This was painfully slow and caused a fundamental problem: by the time the model reached the end of a long sentence, it had "forgotten" the beginning.
Google's 2017 paper "Attention Is All You Need" introduced the Transformer — an architecture that processes the entire sequence at once. This single innovation made modern LLMs possible.
A Transformer is built by stacking identical blocks called layers (GPT-3 has 96 of them). Each layer has two main components:
1. Self-attention: the mechanism that understands context
Self-attention is the heart of the Transformer. It answers one question: for each word, how relevant is every other word?
Consider: "The animal didn't cross the street because it was too tired." What does "it" refer to? You instantly know it's "the animal." Self-attention learns to make this same connection by computing relevance scores between all pairs of tokens.
It works through three learned projections — Query, Key, and Value:
- Query (): What is this token looking for? ("it" is asking: "what noun do I refer to?")
- Key (): What does this token represent? ("animal" answers: "I'm a noun subject")
- Value (): The actual content to pass forward if there's a match
The attention score is computed as:
In Plain English:
- : Multiply each Query by every Key to get a "compatibility score." High score = high relevance.
- : Divide by the square root of the embedding dimension to keep numbers stable (prevents gradients from exploding during training).
- Softmax: Convert raw scores into probabilities that sum to 1. The model might assign 85% attention to "animal" and 5% each to the other words.
- : Use those probabilities to create a weighted blend of all Values — the final contextualized representation.
Multi-Head Attention runs this process multiple times in parallel (e.g., 96 heads in GPT-3). Each head can learn a different relationship — one captures grammar, another tracks coreference, another detects sentiment. The results are concatenated and projected back.
Pro Tip: Think of multi-head attention like a team of editors reviewing the same sentence simultaneously. One checks grammar, one checks meaning, one checks tone — and their combined feedback produces a richer understanding than any single editor could achieve alone.
2. Feed-forward network: where knowledge lives
After attention mixes information across tokens, each token passes independently through a Feed-Forward Network (FFN) — two linear transformations with an activation function in between. These FFN layers contain roughly two-thirds of the model's parameters and are where factual knowledge is believed to be stored.
Layer norm and residual connections: keeping things stable
Two tricks make deep Transformers trainable. Residual connections add each layer's input back to its output (a "skip connection"), so if a layer isn't helpful, the signal survives unchanged. Layer normalization keeps the numbers in a stable range, preventing the model from diverging during training.
Stack attention + FFN + residuals + layer norm = one Transformer block. Stack 96 of these blocks = GPT-3.
How text generation works
Now the Transformer can understand a sequence. But how does it generate text?
The answer is an autoregressive loop: the model predicts one token, appends it to the input, and repeats. When you see an LLM "typing" a response word by word, that's literally what's happening — each word is a separate forward pass through the entire model.
The final layer outputs a vector of logits — raw scores for every token in the vocabulary (typically 50,000-100,000 tokens). A softmax function converts these into probabilities:
In Plain English: If "Paris" has a logit of 8.0 and "Pizza" has 2.0, the exponential function () amplifies that difference dramatically — making "Paris" overwhelmingly likely. Softmax ensures all probabilities sum to 1.
How we pick from these probabilities determines the model's personality. Greedy decoding always picks the highest-probability token (precise but repetitive). Temperature controls randomness — low temperature sharpens the distribution (more deterministic), high temperature flattens it (more creative). Top-P sampling considers only the smallest set of tokens whose probabilities sum to a threshold.
For the full breakdown of Temperature, Top-K, Top-P, and Beam Search, see our upcoming article on LLM Sampling.
How LLMs are trained
Training an LLM happens in three stages, each building on the last.
Stage 1: Pre-training — learning language itself
The model reads trillions of tokens from the internet (Common Crawl, Wikipedia, books, code repositories) with one objective: predict the next token. Given "The capital of France is," the model should output "Paris."
The model minimizes cross-entropy loss — the gap between its predicted probability distribution and the actual next token. Over trillions of examples, the model's weights gradually encode grammar, facts, reasoning patterns, and even coding conventions — all as a side effect of next-token prediction.
| Model | Parameters | Training Tokens | Estimated Cost |
|---|---|---|---|
| GPT-3 | 175B | 300B | ~$5M |
| Llama 3 | 405B | 15T | ~$30M |
| GPT-4 | ~1.8T (MoE) | ~13T | ~$100M+ |
Key Insight: After pre-training, the model is a brilliant text completer — but a terrible assistant. Ask it "What's the capital of France?" and it might respond "What's the capital of Germany?" because it's completing a list of questions, not answering yours.
Stage 2: Supervised Fine-Tuning (SFT) — learning the format
Humans provide thousands of (instruction, response) pairs to teach the model the format of being helpful: "When asked a question, give an answer. When asked to summarize, produce a summary."
Stage 3: RLHF — learning human preferences
The model generates multiple responses, humans rank them, and a Reward Model learns to score responses the way humans would. Then Proximal Policy Optimization (PPO) trains the LLM to maximize that reward score — making it more helpful, honest, and harmless.
For the full training pipeline deep dive, see our upcoming articles on RLHF and Fine-Tuning LLMs.
Code Walkthrough: Self-Attention from Scratch
Let's build the mathematical core — scaled dot-product attention — from scratch, to see exactly how Q, K, and V matrices interact.
import numpy as np
from scipy.special import softmax
def self_attention(Q, K, V):
"""Scaled Dot-Product Attention (the core of every Transformer)."""
d_k = Q.shape[1]
# Step 1: Compatibility scores — how much does each query match each key?
scores = np.matmul(Q, K.T)
# Step 2: Scale to prevent exploding gradients
scaled_scores = scores / np.sqrt(d_k)
# Step 3: Softmax — convert scores to probabilities (each row sums to 1)
attention_weights = softmax(scaled_scores, axis=1)
# Step 4: Weighted combination of values
output = np.matmul(attention_weights, V)
return output, attention_weights
# Simulate: 3 tokens ("I", "love", "data"), embedding dimension = 4
np.random.seed(42)
Q = np.random.randn(3, 4) # Queries
K = np.random.randn(3, 4) # Keys
V = np.random.randn(3, 4) # Values
output, weights = self_attention(Q, K, V)
print("Attention Weights (who attends to whom):")
print(np.round(weights, 3))
# Each row shows how one token distributes its attention across all tokens
# Row 0 = how "I" attends to ["I", "love", "data"]
Output:
Attention Weights (who attends to whom):
[[0.393 0.168 0.439]
[0.231 0.283 0.486]
[0.225 0.559 0.216]]
Reading the Output: Row 0 shows how token "I" distributes its attention: 39.3% to itself, 16.8% to "love", and 43.9% to "data." Token "I" draws most of its context from "data." Row 2 shows "data" puts 55.9% of its attention on "love." This weighted mixing is how every word in the model's output is informed by the full input sequence.
What LLMs cannot do
Understanding the limits is just as important as understanding the capabilities.
Hallucination. LLMs optimize for plausibility, not truth. They'll confidently generate a fake citation that looks perfect because "Author (Year). Title. Journal." is a statistically common pattern — whether or not that paper exists.
No real-time knowledge. The model's knowledge is frozen at training time. It cannot access the internet, check databases, or learn from your conversation. This is why Retrieval-Augmented Generation (RAG) exists — to augment LLMs with current, external knowledge.
Not reasoning — pattern matching. When an LLM solves a math problem, it's not "doing math." It's recognizing patterns from millions of similar problems in its training data. Change the numbers slightly outside the training distribution, and it may fail. This is why reasoning models with explicit chain-of-thought processes represent a significant evolution.
Expensive. Every token generated requires a full forward pass through billions of parameters. GPT-4 costs ~$0.03 per 1K input tokens. At scale, this adds up fast.
Conclusion
Large Language Models are not magic, and they are not sentient. They are extraordinarily powerful pattern-matching engines trained on an objective so simple it fits in one line: predict the next token. From that single idea emerges an architecture — tokenization, embeddings, self-attention, feed-forward layers, autoregressive generation — that captures enough of the structure of human language to be genuinely useful.
The deeper you understand these mechanics, the better you become at using, building on, and reasoning about these systems. You'll know why your prompt matters, why hallucinations happen, and why retrieval-augmented generation helps.
This article is the starting point. From here, the path branches: dive into context engineering to master how to communicate with LLMs, explore text embeddings to understand the vector space that powers semantic search, or jump straight into building AI agents to see what happens when you give an LLM tools and a goal.