Build Your First LLM from ScratchPart 1 · Section 3 of 9

Step 1: Tokenization

Tokenization factory illustration showing text being split into tokens and converted to numbers

Computers don't understand words—they only understand numbers. So we need to convert text into numbers. We do this by creating a vocabulary: a list of all words the model knows, where each word gets a unique ID.

For our calculator, the vocabulary might look like:

{ "zero": 0, "one": 1, "two": 2, "three": 3, ... "plus": 12, "minus": 13, ... }

Now we can convert our input:

"two plus three"
 ↓
["two", "plus", "three"]  → split into words
 ↓
[2, 12, 3]                → look up each word's ID

Each word becomes its ID from the vocabulary. The model never sees "two"—it only sees the number 2.

Helpful?