Build Your First LLM from ScratchPart 4 · Section 2 of 7
Intuition - What is Attention?
The Meeting Room Analogy
Imagine a meeting room where tokens are people:
"two", "plus", and "three" are sitting in a meeting.
"plus" asks: "Who should I pay attention to?"
"two" raises hand: "I'm a number, sitting before you!"
"three" raises hand: "I'm also a number, sitting after you!"
"plus" decides: "I'll pay 50% attention to 'two' and 50% to 'three'"
Now "plus" knows: "I'm adding the number before me to the number after me"What Attention Computes
For each token, attention answers: "How relevant is every other token to me?"
Let's see exactly what happens when our calculator processes "two plus three" during training:
Input: "two plus three"
Tokens: ["two", "plus", "three"]
STEP 1: Each token asks "Who should I pay attention to?"
---------------------------------------------------------
Token "two" computes attention weights:
"two" → 0.70 (I need to know what number I am)
"plus" → 0.20 (The operation affects my role)
"three" → 0.10 (Less relevant to understanding myself)
Token "plus" computes attention weights:
"two" → 0.45 (I need the first operand!)
"plus" → 0.10 (I already know I'm an addition)
"three" → 0.45 (I need the second operand!)
Token "three" computes attention weights:
"two" → 0.15 (The other number in the equation)
"plus" → 0.25 (The operation I'm part of)
"three" → 0.60 (I need to know what number I am)
STEP 2: Weighted combination updates each embedding
---------------------------------------------------
Before attention:
"two" = [0.8, 0.1, ...] (just knows "I'm the number 2")
"plus" = [0.1, 0.9, ...] (just knows "I'm addition")
"three" = [0.7, 0.2, ...] (just knows "I'm the number 3")
After attention:
"two" = [0.75, 0.3, ...] (knows "I'm 2, being added to something")
"plus" = [0.5, 0.5, ...] (knows "I'm adding 2 and 3") ← KEY!
"three" = [0.65, 0.4, ...] (knows "I'm 3, being added to something")Why "plus" matters most: Notice how "plus" gathers information from both numbers equally (0.45 each). After attention, the "plus" embedding now encodes the complete operation "2 + 3". This is why the final prediction often comes from the operation token—it has collected all the information needed to compute the answer.
Another Example: "five minus one"
Token "minus" computes attention weights:
"five" → 0.50 (I need the number I'm subtracting FROM)
"minus" → 0.05 (I know I'm subtraction)
"one" → 0.45 (I need the number I'm subtracting)
After attention, "minus" embedding contains:
- Information about 5 (the minuend)
- Information about 1 (the subtrahend)
- Its own subtraction semantics
→ Ready to predict "four"!The key insight: attention lets each token gather exactly the information it needs from the other tokens. The model learns these attention patterns during training—we don't program them!
Our Model vs. At Scale
| Aspect | Our Calculator | GPT-4 Scale |
|---|---|---|
| Tokens per input | 3-5 tokens | Thousands of tokens |
| Patterns learned | Operations look at numbers | Complex: pronouns→nouns, verbs→subjects, questions→context |
| Mechanism | Same | Same, just more tokens and dimensions |
Key insight: We don't program these patterns—the model learns them during training!
Helpful?