Tokenization (BPE Encoding)

Intuition

A language model never sees your text. Before any neural network runs, the text is chopped into tokens — and each token is just an integer index into a fixed vocabulary. Tokenization is the translation layer between the words you type and the numbers the model actually consumes.

Modern models use subword tokens: not whole words, not single characters, but reusable chunks somewhere in between. The word hugging might become hug + ging; a rare word you have never trained on still splits into pieces the model knows. Byte-Pair Encoding (BPE) is the algorithm that decides where those splits fall. This lesson runs the encoding half — applying an already-learned set of merge rules to a word. (Where those rules come from is the companion BPE Training lesson.)

How It Works

BPE encoding starts pessimistically: every character is its own token. Then it repeatedly glues two adjacent tokens together, following a ranked list of merge rules learned from a corpus. Each rule says "whenever you see this pair next to each other, fuse them." The rules are ordered by priority — rule #1 was the most useful merge during training, so it always fires first.

On every pass the algorithm scans the current tokens, finds the adjacent pair whose rule has the best (lowest) rank, and merges it everywhere it occurs. Then it scans again. When no adjacent pair matches any rule, encoding is done. Finally each surviving token is looked up in the vocabulary to get its integer ID.

The key property: BPE is greedy by rank, not left-to-right. It does not merge the first pair it sees — it merges the highest-priority pair available, even if that means skipping over earlier positions. Watch the merge table on the left light up out of order.

Step By Step

The default input is the word hugs, encoded with these four learned rules:

#rule
1(u, g) → ug
2(u, n) → un
3(h, ug) → hug
4(p, un) → pun
  1. Start. hugs becomes four character tokens: h u g s.
  2. Scan. The adjacent pairs are (h,u), (u,g), (g,s). Only (u,g) matches a rule — rule #1 — so it wins.
  3. Merge. u and g fuse into one wider cell: h ug s.
  4. Scan. The pairs are now (h,ug) and (ug,s). (h,ug) matches rule #3. Notice rule #2 (u,n) is skipped — there is no (u,n) here, so the algorithm jumps to the best rule that does apply.
  5. Merge. h and ug fuse: hug s.
  6. Settle. The only pair left is (hug,s), which matches no rule. Encoding stops.
  7. Tokens. The final tokens are [hug, s] — two tokens, not one. Even though hugs was a whole word in the training corpus, BPE rebuilt it from the merges it learned, and it only ever learned to build hug.
  8. IDs. Each token is replaced by its vocabulary index: hug → 9, s → 5. The model receives [9, 5].

Watch the width of each cell: a single character is narrow, a fused chunk is wide. The growing cells are the subwords forming.

Complexity

Tokens scanned per passO(n)
PassesO(merges applied)
OverallO(n · m) for n symbols, m merges applied

Production tokenizers make this far faster with priority queues, but the behaviour is identical: scan, pick the best-ranked pair, merge, repeat.

Edge Cases

  • Unseen words still tokenize. Try bug — it never appears in the training corpus, yet it encodes cleanly to [b, ug]. This is the whole point of subwords: there is no "unknown word," only smaller known pieces.
  • A word can stay in many pieces. pugs ends as [p, ug, s] — three tokens — because no rule fuses p with ug. Token count is not word count.
  • A single character is already a token; nothing to merge.
  • Rank, not position, decides. When two different rules could fire, the lower-ranked (earlier-learned) one always wins, regardless of where it sits.

Common Mistakes

  • Thinking one word = one token. hugs is two tokens here; in real models a short common word is often one token and a long rare word is many. Billing and context limits are counted in tokens, not words.
  • Merging left-to-right. BPE is greedy by rank. Applying the leftmost matching pair first can produce a different, wrong tokenization.
  • Confusing the token string with its ID. hug is a token; 9 is its vocabulary index. The model only ever works with the integers.
  • Assuming characters map to fixed IDs. A character's ID depends on the vocabulary the tokenizer was trained with — there is no universal table.

A Note on Simplification

This lesson uses a deliberately tiny, character-level BPE with a four-rule merge table and an 11-token vocabulary, so the whole process fits on one screen. Real tokenizers (GPT, Llama, etc.) operate on raw bytes, learn tens of thousands of merges, attach a leading-space marker to words, and add special tokens. The mechanism shown here — start from atoms, greedily merge the best-ranked adjacent pair, map to integer IDs — is exactly what those systems scale up. Treat this as an explainer, not a substitute for a production tokenizer's edge cases.