Build Your First LLM from ScratchPart 1 · Section 4 of 9

Step 2: Embedding

Embedding illustration showing words being placed in a city map where similar words are neighbors

A single number like "2" doesn't tell the model much. Is "two" similar to "three"? Is it related to "plus"? To capture these relationships, we convert each token ID into a vector—a list of numbers that represents the word's meaning.

Think of it like coordinates. In a city, "123 Main St" tells you exactly where something is. Similarly, a vector like [0.2, -0.5, 0.8] places a word in "meaning space":

"two"   → [0.9, 0.1, 0.2, ...]   (e.g. 64 numbers)
"three" → [0.85, 0.15, 0.25, ...]  (e.g. 64 numbers)
"plus"  → [0.1, 0.8, 0.3, ...]    (e.g. 64 numbers)

Why so many numbers? The number of values (e.g. 64) is called the "embedding dimension"—a choice we make when designing the model.

Imagine describing a person with just 2 numbers: height and weight. That's useful, but limited. Now add age, income, years of education—each number captures a new dimension of who they are. With more numbers, you can distinguish between people more precisely.

The same idea applies to words:

2 dimensions: Can barely distinguish words
64 dimensions: Enough for a simple task like our calculator
12,000+ dimensions: What models like GPT-4 use to capture nuance in all of human language

More dimensions = more detail, but slower training. We'll use 64 as an example throughout this tutorial—it's small enough to understand but powerful enough for our calculator.

What do the individual numbers mean? We don't know! The model learns these values during training. But for our calculator, we can imagine what some dimensions might capture:

Dimension	What it might represent	"two"	"plus"
#1	Is it a number word?	0.95	0.05
#2	Is it an operation?	0.05	0.92
#3	Is it a small number (0-10)?	0.88	0.00
#4	Is it addition-related?	0.10	0.95
...	...	...	...
#64	(something the model found useful)	0.23	0.67

In reality, these dimensions are polysemantic—a single dimension might encode multiple unrelated concepts at once. Real embedding spaces are messy and don't map to human-understandable ideas. But the result is the same: similar words get similar vectors.

Notice how "two" and "three" have similar vectors (both are numbers), while "plus" is quite different (it's an operation). Words with similar meanings end up close together in this space.

Our input becomes:

[2, 12, 3] → [[0.9, 0.1, ...], [0.1, 0.8, ...], [0.85, 0.15, ...]]
              "two"          "plus"          "three"

Helpful?

Step 1: Tokenization Step 3: Positional Encoding