Build Your First LLM from ScratchPart 3 · Section 2 of 13

Building the Vocabulary

A vocabulary is simply a mapping from words to numbers. Every word the model knows gets a unique ID.

Our Vocabulary

For our calculator, we manually list all ~36 words:

vocabulary = {
    # Special tokens
    "[PAD]": 0,    # Padding for batch processing
    "[START]": 1,  # Start of sequence
    "[END]": 2,    # End of sequence

    # Numbers 0-19
    "zero": 3, "one": 4, "two": 5, "three": 6, "four": 7,
    "five": 8, "six": 9, "seven": 10, "eight": 11, "nine": 12,
    "ten": 13, "eleven": 14, "twelve": 15, "thirteen": 16,
    "fourteen": 17, "fifteen": 18, "sixteen": 19, "seventeen": 20,
    "eighteen": 21, "nineteen": 22,

    # Tens
    "twenty": 23, "thirty": 24, "forty": 25, "fifty": 26,
    "sixty": 27, "seventy": 28, "eighty": 29, "ninety": 30,

    # Operations
    "plus": 31, "minus": 32, "times": 33, "divided": 34, "by": 35,
}

Each word becomes its ID: "two"5, "plus"31, "three"6

Special tokens: [START] and [END] mark sequence boundaries—our tokenizer adds these automatically. The model learns that [END] means "stop generating." [PAD] fills shorter sequences when batching.

At Scale: BPE Tokenization

Real models like GPT-4 and LLaMA use Byte Pair Encoding (BPE) to automatically build vocabularies of ~100,000 tokens:

  • Words split into subwords: "unhappiness" → ["un", "happi", "ness"]
  • Handles any word, any language, even misspellings
  • Vocabulary learned from training corpus, not manually created
  • Tools: sentencepiece, tiktoken, Hugging Face tokenizers
Why subwords? With word-level tokens, "running" and "runs" are completely different. With subwords, both share "run" and the model learns they're related.
Helpful?