GPT-4 can write Shakespearean sonnets, pass the bar exam, and debug complex code — but ask it "how many r's are in strawberry?" and it confidently answers two. The culprit isn't the neural network. It's the step that happens before the neural network ever sees your text: tokenization. The way a model splits text into pieces determines what it can count, what it can spell, why it struggles with arithmetic, and even how much you pay per API call. Tokenization is the invisible foundation that shapes everything a language model can and cannot do.
The preprocessing step nobody talks about
Language models don't read characters. They don't read words. They read tokens — numerical IDs that represent chunks of text. Before any transformer attention head fires, before any embedding lookup happens, raw text must be converted into a sequence of integer IDs from a fixed vocabulary.
This creates a fundamental design tension. If each character is a token (vocabulary of ~256), sequences become extremely long — a single paragraph might be 500+ tokens, and the transformer's attention mechanism scales quadratically with sequence length. If each word is a token, you need a vocabulary of hundreds of thousands of entries, and any word not in the vocabulary becomes an unknown <UNK> token — devastating for names, typos, code, or any language not well-represented in training data.
The solution that powers every major language model in 2026 is subword tokenization: break text into pieces that are larger than characters but smaller than words. Common words like "the" stay intact. Rare words like "tokenization" get split into meaningful pieces like "token" + "ization". The dominant algorithm for learning these splits is called Byte-Pair Encoding.
Byte-pair encoding: the algorithm behind every major LLM
BPE was originally a data compression algorithm (Gage, 1994), adapted for NLP by Sennrich, Haddow, and Birch (2016). The idea is elegant: start with the smallest possible units and iteratively merge the most frequent adjacent pairs.
Here's BPE training in 15 lines of Python:
text = "aaabdaaabac"
tokens = list(text)
print(f"Start: {tokens} ({len(tokens)} tokens)")
for step in range(3):
# Count all adjacent pairs
pairs = {}
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i + 1])
pairs[pair] = pairs.get(pair, 0) + 1
# Merge the most frequent pair
best = max(pairs, key=pairs.get)
merged = best[0] + best[1]
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and tokens[i] == best[0] and tokens[i + 1] == best[1]:
new_tokens.append(merged)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
print(f"Merge {step+1}: {best[0]}+{best[1]} -> {merged:4s} {tokens} ({len(tokens)} tokens)")
Start: ['a', 'a', 'a', 'b', 'd', 'a', 'a', 'a', 'b', 'a', 'c'] (11 tokens)
Merge 1: a+a -> aa ['aa', 'a', 'b', 'd', 'aa', 'a', 'b', 'a', 'c'] (9 tokens)
Merge 2: aa+a -> aaa ['aaa', 'b', 'd', 'aaa', 'b', 'a', 'c'] (7 tokens)
Merge 3: aaa+b -> aaab ['aaab', 'd', 'aaab', 'a', 'c'] (5 tokens)
Three merges compressed 11 tokens down to 5. Real tokenizers perform tens of thousands of merges: GPT-2 learned 50,000 merges; GPT-4o learned approximately 200,000. The merge rules are saved after training and applied deterministically to new text during inference.
Byte-level BPE: the key innovation
The original BPE operated on characters, which still left gaps for unseen Unicode characters. Radford et al. (2019) introduced byte-level BPE in GPT-2: instead of starting with characters, start with raw bytes (a base vocabulary of exactly 256). Since any text in any language encodes to a sequence of bytes via UTF-8, byte-level BPE guarantees zero unknown tokens for any input — English, Chinese, Arabic, emoji, code, or binary data.
Modern tokenizers also use regex-based pre-tokenization to prevent merges across category boundaries. GPT-2 introduced a regex pattern that keeps contractions ("don't" → "don" + "'t"), separates numbers from letters, and prevents spaces from merging with words. GPT-4's cl100k_base and GPT-4o's o200k_base use increasingly sophisticated patterns that also handle CJK characters and non-Latin scripts.
Key Insight: BPE's merge list is a learned compression scheme tuned to the tokenizer's training data. It is not the same as the model's training data — and this mismatch is the root cause of glitch tokens, which we'll explore later.
The tokenizer landscape in 2026
Vocabulary sizes have grown dramatically over the past three years. Larger vocabularies mean shorter token sequences (less compute in self-attention) and better multilingual coverage, at the cost of larger embedding matrices.
| Model | Tokenizer | Vocab Size | Type | Release |
|---|---|---|---|---|
| GPT-5 / GPT-4o / o3 | tiktoken (o200k_base) | ~200,000 | Byte-level BPE | 2024-2025 |
| Llama 4 (Scout/Maverick) | tiktoken-based | 202,048 | Byte-level BPE | April 2025 |
| Gemini 3 / Gemma 3 | SentencePiece | 262,144 | Unigram/BPE | 2025 |
| Claude Opus 4.6 / Sonnet 4 | Proprietary BPE | ~65,536 | Byte-level BPE | 2025 |
| DeepSeek-V3 / R2 | Custom BPE | 128,000 | Byte-level BPE | Dec 2024 |
| Qwen3 | Custom BBPE | ~151,936 | Byte-level BPE | 2025 |
| Mistral (Tekken) | tiktoken-based | 131,072 | BPE | 2024 |
| Llama 3 | tiktoken-based | 128,256 | Byte-level BPE | 2024 |
| Llama 2 | SentencePiece | 32,000 | BPE + byte fallback | 2023 |
| GPT-4 / GPT-3.5 | tiktoken (cl100k_base) | ~100,277 | Byte-level BPE | 2023 |
The trend is clear: from 32K (Llama 2, 2023) to 262K (Gemini 3, 2025) — an 8x increase in three years. Tao et al. (2024) showed this isn't arbitrary: there's a log-linear relationship between vocabulary size and training loss. Llama 2's vocabulary of 32K was optimal for a 7B model, but for the 70B variant, the compute-optimal vocabulary would have been at least 216K — 7x larger than what was used.
Three tokenizer families
tiktoken (OpenAI, Rust core): The fastest option at 3-6x faster than alternatives. Inference only — no training support. Powers OpenAI models and has been adopted by Llama 3+ and Mistral's Tekken.
SentencePiece (Kudo and Richardson, 2018): Treats input as a raw character stream with no pre-tokenization, encoding spaces as the metasymbol "▁". Supports both BPE and Unigram algorithms. Particularly strong for languages without clear word boundaries (Chinese, Japanese, Thai). Powers Google's Gemini and Gemma families.
HuggingFace Tokenizers (GitHub): Rust-backed library supporting BPE, WordPiece, and Unigram. The most flexible option with full training support, used by thousands of open-source models.
Pro Tip: The Unigram algorithm (Kudo, 2018) works top-down instead of bottom-up: it starts with a large vocabulary and prunes tokens whose removal increases loss the least. Unigram recovers morphological suffixes like "-ly", "-ing", and "-tion" far more reliably than BPE, making it better for morphologically rich languages.
Five ways tokenization breaks your model
Tokenization is not a solved problem. It introduces systematic failures that affect model accuracy, fairness, and cost.
1. Arithmetic and number tokenization
Ask GPT-4 to compute 1,234 + 5,678 and it might get it wrong — not because the transformer can't do addition, but because the tokenizer splits numbers inconsistently. "480" might be a single token while "481" splits into "4" + "81". GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and 3-digit number, so "1234" becomes ["123", "4"] — the model never sees the individual digits aligned for column addition.
Singh and Strouse (2024) demonstrated at ICLR 2025 that right-to-left tokenization improves arithmetic accuracy by over 22 percentage points, and simply adding commas to numbers ("1,234") forces digit grouping that aligns addends correctly. LLaMA actually outperforms GPT-4 on arithmetic partly because its single-digit tokenization keeps every digit as a separate token.
2. The multilingual token tax
The same sentence costs dramatically different amounts to process depending on the language. Lundin et al. (2025) found tokenization premiums of 2-5x for low-resource African languages compared to English, with the cost impact amplified further by quadratic attention scaling. Arabic text requires 68% to 340% more tokens than equivalent English text, depending on the tokenizer.
This is not just an efficiency issue — it's a fairness issue. Higher fertility (tokens per word) means longer sequences, which means more compute, higher latency, and higher API costs for the same semantic content. Research shows fertility explains 20-50% of the variance in model accuracy across languages: higher fertility consistently predicts lower accuracy.
The o200k_base tokenizer used by GPT-4o and GPT-5 significantly improved non-Latin compression compared to cl100k_base, but the gap persists. Arnett et al. (2025) at NeurIPS 2025 proposed training per-language vocabularies and using superword tokenizers to reduce cross-lingual inequity.
3. Glitch tokens: the SolidGoldMagikarp problem
In January 2023, researchers Jessica Rumbelow and Matthew Watkins discovered that asking ChatGPT to repeat the token "SolidGoldMagikarp" produced the word "distribute" instead. Other glitch tokens triggered evasive responses, gibberish, or the model claiming it "can't say that."
The root cause: mismatch between tokenizer training data and model training data. "SolidGoldMagikarp" was a Reddit username frequent enough in the tokenizer's text corpus to earn its own BPE token, but it appeared so rarely in the model's training data that the model never learned what the token means. The result is an embedding with essentially random values.
GPT-4's o200k_base tokenizer no longer encodes "SolidGoldMagikarp" as a single token. But the problem hasn't disappeared: systematic research found that approximately 4.3% of vocabulary entries across tested models are glitch tokens. The GlitchMiner framework (AAAI 2026, arXiv:2410.15052) uses gradient-based entropy maximization to systematically mine glitch tokens in any model, and found them in GPT-4, Llama 2, Mistral, and even DeepSeek-V3.
4. Code formatting waste
Formatting elements — whitespace, indentation, newlines — consume approximately 24.5% of tokens across programming languages while providing minimal semantic value. Pan et al. (2025) showed that Java loses 14.7% and C# loses 13.2% of tokens to pure formatting overhead in raw analysis. Python is harder to optimize because indentation is syntactically meaningful.
GPT-2 tokenized each space individually. GPT-4's cl100k_base tokenizer groups 4 spaces into a single token (token ID 257) and has dedicated tokens for whitespace sequences up to 128 consecutive spaces — a sign of how much effort has gone into making code tokenization less wasteful.
5. Token boundary effects
When token boundaries in the prompt don't align with what the model expects, performance degrades dramatically. In Chinese text, misaligned token boundaries cause the probability of the correct next token to drop by up to four orders of magnitude, and accuracy drops 60-95% across models and domains.
Counterintuitively, larger models suffer more — they are more tightly fitted to their tokenizer's segmentation patterns. Microsoft's Guidance library implements "token healing," a technique that backs up the prompt by removing partial tokens and re-samples continuations that match the removed text, re-aligning boundaries.
Common Pitfall: Prompting with <a href="http: won't produce // as the next characters — because :// is a single token (token ID 1129 in cl100k_base), but your prompt forced : to be tokenized separately. The model doesn't know how to continue from a boundary that never occurs in training data. Token healing fixes this.
The vocabulary size tradeoff
Choosing a vocabulary size is one of the most consequential decisions in building a language model:
In Plain English: Every time the model predicts the next token, it computes a probability over the entire vocabulary. A 262K vocabulary (Gemini 3) means 262,144 softmax computations per token — 8x more than Llama 2's 32K vocabulary. But because larger vocabularies produce shorter sequences, the total compute often decreases: fewer tokens means fewer attention computations, which scale quadratically.
Smaller vocabularies (32K, like Llama 2): more subword splitting, longer sequences, but smaller embedding matrices and more consistent digit-by-digit number tokenization.
Larger vocabularies (200K+, like GPT-5 and Llama 4): shorter sequences, better multilingual coverage, but larger embedding matrices. A 200K vocabulary with 4096-dimensional embeddings uses ~800M parameters just for the embedding table.
The scaling law from Tao et al. (2024) at NeurIPS 2024 showed that increasing vocabulary from 32K to 43K improved ARC-Challenge accuracy from 29.1 to 32.0 with the same compute budget — a free performance gain from better tokenization alone.
Beyond subwords: the rise of byte-level models
The most exciting development in tokenization is the push to eliminate it entirely. If models could process raw bytes, every problem described above — arithmetic splits, multilingual inequality, glitch tokens, boundary effects — would disappear.
ByT5 (Xue et al., 2022) proved the concept: a standard transformer processing byte sequences can be competitive with token-level models. But the cost was steep — byte sequences are 4-5x longer, making attention prohibitively expensive.
MegaByte (Yu et al., 2023) from Meta introduced a two-level architecture: a large "global" transformer processes fixed-size patches of bytes, while a smaller "local" transformer handles individual bytes within each patch. This achieves sub-quadratic scaling for million-byte sequences.
SpaceByte (Slagle, 2024, NeurIPS 2024) took a smarter approach to patching: instead of fixed-size patches, it applies the larger transformer blocks only after space characters — natural word boundaries. SpaceByte matched subword transformer performance on English text and code.
The breakthrough came in December 2024 with Meta's Byte Latent Transformer (BLT) (Pagnoni et al., 2024). BLT uses entropy-based dynamic patching: a small byte-level language model computes next-byte entropy, and patch boundaries are placed where the next byte is hardest to predict. Simple, predictable regions (common words) get large patches requiring little compute; complex regions (rare words, code, numbers) get small patches with more compute.
The results: BLT matches Llama 3 at the 8B parameter scale while using up to 50% fewer inference FLOPs. BLT is also inherently robust to character-level perturbations — typos, spelling variations, and novel words don't cause catastrophic failures the way they can with subword tokenizers.
The 2025-2026 frontier: rethinking tokenization
Even within the subword paradigm, 2025 has brought significant innovations:
SuperBPE (COLM 2025): A two-pass BPE that first learns standard tokens, then learns cross-word "superword" tokens. SuperBPE produces 33% fewer tokens and improves average performance by 4.0% across 30 benchmarks, with an 8.2% gain on MMLU — just from better tokenization.
BoundlessBPE (COLM 2025): Relaxes the pre-tokenization boundary constraint entirely, allowing merges across word boundaries. Achieves up to 15% improvement in bytes per token.
LiteToken (February 2026): Identifies and removes "intermediate merge residues" — tokens that are frequent during BPE training but rarely appear in the final tokenized output. These residue tokens waste vocabulary slots and cause unnecessary fragmentation. LiteToken is plug-and-play: it works with any existing tokenizer.
Dynamic tokenization is also gaining traction. ADAT (NeurIPS 2024) iteratively refines the vocabulary based on model feedback during training. Retrofitting LLMs with Dynamic Tokenization (ACL 2025) enables flexible tokenization that adapts post-training, reducing inference FLOPs by choosing granularity adaptively.
On the theoretical side, Gastaldi et al. (2025) at ICLR 2025 established the first formal mathematical foundations of tokenization, proving necessary and sufficient conditions for tokenizer consistency. Meanwhile, Rajaraman et al. (2024) at NeurIPS 2024 proved that transformers cannot learn k-th order Markov sources without tokenization but can with it — the first theoretical justification for why tokenization helps rather than just compresses.
Practical considerations: cost, speed, and choosing a tokenizer
Tokenization directly affects your API bill
All major LLM providers charge per token. Since different tokenizers produce different token counts for the same text, the choice of model affects cost independently of the model's quality:
- A prompt of 1,000 English words is roughly 1,300 tokens with GPT-4o's o200k_base but could be 1,500+ tokens with a smaller vocabulary tokenizer
- Non-English text shows even larger differences: the same Arabic paragraph might cost 3x more to process with DeepSeek than with Qwen
Output tokens cost 2-5x more than input tokens across all major providers. Since tokenization affects both input length (your prompt) and output length (the model's response), a model with a more efficient tokenizer for your language can save significant money at scale.
Pro Tip: Prompt caching (available from OpenAI, Anthropic, and Google) gives up to 90% discounts on repeated input tokens. Combined with a large-vocabulary tokenizer that produces fewer tokens, you can achieve 60-80% cost reductions on production workloads.
Multimodal tokenization: beyond text
Modern multimodal models tokenize images, audio, and video alongside text. Images are typically divided into fixed-size patches (e.g., 16x16 pixels), each becoming a single token — the same approach as ViT (Dosovitskiy et al., 2021). Audio uses neural codecs like SoundStream and EnCodec that convert continuous waveforms into discrete token sequences.
Llama 4 uses "early fusion" — text and vision tokens are integrated into a unified backbone and jointly pre-trained. Adaptive Patch Transformers (2025) go further by using multiple patch sizes within the same image: larger patches for homogeneous sky regions, smaller patches for detailed faces. This can reduce memory requirements by up to 99.8% compared to raw pixel tokenization.
Conclusion
Tokenization is the most underappreciated component of the language model stack. Every problem you've encountered with LLMs — arithmetic failures, "how many r's in strawberry" mistakes, inflated API costs for non-English text, mysterious glitch token behavior — traces back to how text gets split into tokens before the model ever processes it.
The field is at an inflection point. BPE has served us well since 2016, and innovations like SuperBPE and LiteToken are pushing the subword paradigm further. But byte-level models like Meta's BLT have demonstrated that tokenization-free architectures can match tokenized models at scale while eliminating entire categories of failure modes. The question is no longer whether models can work without tokenizers, but when the transition happens.
For practitioners, the immediate takeaway is that tokenization is a first-class design decision. The choice of tokenizer affects model accuracy, multilingual fairness, inference cost, and even which tasks your model can reliably perform. Understanding tokenization isn't optional — it's foundational.
To build on this understanding, explore how tokenization connects to the broader LLM pipeline: How Large Language Models Actually Work covers the transformer architecture that processes tokens, Text Embeddings: The Foundation of Semantic Search explains how tokens become vectors, and Context Engineering: From Prompts to Production shows how to work within token limits effectively. For the frontier of model intelligence built on top of tokenization, see Reasoning Models: How AI Learned to Think Step by Step.