Step 1: Tokenization

Loading...
Tokenization factory illustration showing text being split into tokens and converted to numbers
Words go in, numbers come out

Think of this as the first station in our factory. Raw text arrives on the conveyor belt, and this machine stamps each word with its ID number.

Computers don't understand words—they only understand numbers. So we need to convert text into numbers. We do this by creating a vocabulary: a list of all words the model knows, where each word gets a unique ID.

For our calculator, the vocabulary maps words to numbers:

WordIDWordIDWordID
zero0one1two2
three3four4five5
plus12minus13times14

Now we can convert our input through the tokenizer:

Input
"two plus three"
Split
["two", "plus", "three"]
Lookup
[2, 12, 3]

Each word becomes its ID from the vocabulary. The model never sees "two"—it only sees the number 2.

Helpful?