BPE Training
Intuition
Subword tokens are not designed by hand — they are discovered from data. Byte-Pair Encoding starts with the smallest possible vocabulary (just the individual characters) and grows it greedily: find the two symbols that sit next to each other most often across the whole corpus, glue them into a new symbol, and repeat. The merges that survive are exactly the chunks the language uses most, which is why BPE vocabularies end up full of common prefixes, suffixes, and whole frequent words.
This lesson runs the training loop that produces the merge table. The companion Tokenization lesson then applies that table to encode new words — and it uses the very rules learned here.
How It Works
The corpus is a handful of words, each with a frequency (how often it occurs). Every word begins as a list of single characters. Then, on each iteration:
- Count every adjacent pair of symbols across all words, weighting each occurrence by its word's frequency — a pair inside a word seen 12 times contributes 12, not 1.
- Pick the most frequent pair.
- Merge it everywhere it appears, replacing the two symbols with one fused symbol, and record the rule.
The recorded rules are ordered — the first merge learned is the highest-priority rule. That ordering is what the encoder later replays. The loop stops after a fixed number of merges; in practice you stop at a target vocabulary size.
Step By Step
The default run learns 4 merges from this corpus:
| word | frequency |
|---|---|
hug | 10 |
pug | 5 |
pun | 12 |
bun | 4 |
hugs | 5 |
- Start. Every word becomes characters:
h u g,p u g,p u n,b u n,h u g s. - Merge 1. Counting pairs:
(u,g)appears inhug(10) +pug(5) +hugs(5) = 20, the highest. Merge(u,g) → ug. Nowh ug,p ug,p u n,b u n,h ug s. - Merge 2. Recount.
(u,n)=pun(12) +bun(4) = 16 wins. Merge(u,n) → un. Nowh ug,p ug,p un,b un,h ug s. - Merge 3.
(h,ug)=hug(10) +hugs(5) = 15 wins. Merge(h,ug) → hug. Nowhug,p ug,p un,b un,hug s. - Merge 4.
(p,un)=pun(12) wins. Merge(p,un) → pun. Nowhug,p ug,pun,b un,hug s. - Done. The learned rules, in priority order, are
ug, un, hug, pun. Together with the seven base characters (b g h n p s u) they form an 11-token vocabulary.
Watch the tally panel in the middle: each count beat ranks the candidate pairs and the winner lights up, then fuses across every word that contains it on the next beat. The list on the right is the table that the Tokenization lesson reads.
Complexity
| Count pairs per iteration | O(total symbols) |
| Iterations | O(merges) |
| Overall | O(merges · total symbols) |
Real implementations cache pair counts and update them incrementally instead of recounting from scratch, but the result is identical.
Edge Cases
- Frequency ties. This run stops at 4 merges on purpose: the 5th candidate
is a tie —
(p,ug)and(hug,s)both score 5. Real BPE breaks ties with a fixed rule (e.g. lexicographic order); we stop before it so every beat has an unambiguous winner. - A pair spans many words.
(u,g)is counted acrosshug,pug, andhugstogether — frequencies add up. Merging is global, not per-word. - Merging fuses every occurrence. When a rule fires it replaces the pair in all words at once, not just the first one found.
- Order matters. Learn the merges in a different order and you get a different vocabulary. The priority ranking is part of the trained model.
Common Mistakes
- Counting pairs unweighted. Word frequency is part of the count, not a
tiebreaker. Unweighted, the second merge is a tie —
(u,n)and(h,ug)each occur in two words — and only the frequency weight (16 vs 15) makes(u,n)win it. Drop the weighting and the merge order changes. - Merging inside one word only. A merge applies to the whole corpus; the rule is global.
- Confusing training with encoding. Training produces the ranked rules; encoding replays them on a new word. Same merges, opposite direction.
- Expecting whole words as tokens. Even
hugs, a corpus word, never becomes a single token here — only the pieces the loop had time to learn.
A Note on Simplification
This lesson trains on a five-word toy corpus and learns four character-level merges, so the entire process fits on one screen. Real tokenizers train on millions of words, operate on raw bytes (not characters), learn tens of thousands of merges, and add tie-breaking rules, word-boundary markers, and special tokens. The core loop shown here — count adjacent pairs, merge the most frequent, repeat — is exactly what those systems scale up. Treat this as an explainer, not a production tokenizer trainer.