Step 1: Tokenization
Loading...

Think of this as the first station in our factory. Raw text arrives on the conveyor belt, and this machine stamps each word with its ID number.
Computers don't understand words—they only understand numbers. So we need to convert text into numbers. We do this by creating a vocabulary: a list of all words the model knows, where each word gets a unique ID.
For our calculator, the vocabulary maps words to numbers:
| Word | ID | Word | ID | Word | ID |
|---|---|---|---|---|---|
| zero | 0 | one | 1 | two | 2 |
| three | 3 | four | 4 | five | 5 |
| plus | 12 | minus | 13 | times | 14 |
Now we can convert our input through the tokenizer:
Input
"two plus three"
→
Split
["two", "plus", "three"]
→
Lookup
[2, 12, 3]
Input
"two plus three"
↓
Split
["two", "plus", "three"]
↓
Lookup
[2, 12, 3]
Each word becomes its ID from the vocabulary. The model never sees "two"—it only sees the number 2.
Helpful?