Build Your First LLM from ScratchPart 1 · Section 3 of 9
Step 1: Tokenization

Computers don't understand words—they only understand numbers. So we need to convert text into numbers. We do this by creating a vocabulary: a list of all words the model knows, where each word gets a unique ID.
For our calculator, the vocabulary might look like:
{ "zero": 0, "one": 1, "two": 2, "three": 3, ... "plus": 12, "minus": 13, ... }Now we can convert our input:
"two plus three"
↓
["two", "plus", "three"] → split into words
↓
[2, 12, 3] → look up each word's IDEach word becomes its ID from the vocabulary. The model never sees "two"—it only sees the number 2.
Helpful?