Step 2: Embedding

A single number like "2" doesn't tell the model much. Is "two" similar to "three"? Is it related to "plus"? To capture these relationships, we convert each token ID into a vector—a list of numbers that represents the word's meaning.
Think of it like coordinates. In a city, "123 Main St" tells you exactly where something is. Similarly, a vector like [0.2, -0.5, 0.8] places a word in "meaning space":
"two" → [0.9, 0.1, 0.2, ...] (e.g. 64 numbers)
"three" → [0.85, 0.15, 0.25, ...] (e.g. 64 numbers)
"plus" → [0.1, 0.8, 0.3, ...] (e.g. 64 numbers)Why so many numbers? The number of values (e.g. 64) is called the "embedding dimension"—a choice we make when designing the model.
Imagine describing a person with just 2 numbers: height and weight. That's useful, but limited. Now add age, income, years of education—each number captures a new dimension of who they are. With more numbers, you can distinguish between people more precisely.
The same idea applies to words:
- 2 dimensions: Can barely distinguish words
- 64 dimensions: Enough for a simple task like our calculator
- 12,000+ dimensions: What models like GPT-4 use to capture nuance in all of human language
More dimensions = more detail, but slower training. We'll use 64 as an example throughout this tutorial—it's small enough to understand but powerful enough for our calculator.
What do the individual numbers mean? We don't know! The model learns these values during training. But for our calculator, we can imagine what some dimensions might capture:
| Dimension | What it might represent | "two" | "plus" |
|---|---|---|---|
| #1 | Is it a number word? | 0.95 | 0.05 |
| #2 | Is it an operation? | 0.05 | 0.92 |
| #3 | Is it a small number (0-10)? | 0.88 | 0.00 |
| #4 | Is it addition-related? | 0.10 | 0.95 |
| ... | ... | ... | ... |
| #64 | (something the model found useful) | 0.23 | 0.67 |
Notice how "two" and "three" have similar vectors (both are numbers), while "plus" is quite different (it's an operation). Words with similar meanings end up close together in this space.
Our input becomes:
[2, 12, 3] → [[0.9, 0.1, ...], [0.1, 0.8, ...], [0.85, 0.15, ...]]
"two" "plus" "three"