Build Your First LLM from ScratchPart 4 · Section 1 of 7
What We'll Build
In Part 3, we created embeddings—vectors that represent each token. But there's a problem: each token is isolated. The word "plus" doesn't know it sits between "two" and "three".
Attention solves this by letting each token "look at" every other token and gather relevant information.
Embeddings from Part 3
↓
[Self-Attention]
↓
Each token now "knows about" other tokens
↓
[Multi-Head Attention]
↓
Multiple perspectives combined
↓
Ready for Transformer Block (Part 5)The Problem Attention Solves
After Part 3, we have embeddings for "two plus three":
"two" → [0.23, -0.45, ...] (64 numbers)
"plus" → [0.67, 0.12, ...] (64 numbers)
"three" → [0.89, -0.34, ...] (64 numbers)These vectors are isolated. "plus" doesn't know it sits between "two" and "three". Attention lets each token gather information from other tokens.
Sections Overview
| Section | What We Build | At Scale |
|---|---|---|
| 4.1 | Intuition: What is attention? | Same concept |
| 4.2 | Query, Key, Value | Same, larger matrices |
| 4.3 | Attention scores | Scaled dot-product |
| 4.4 | Single-head attention | Same pattern |
| 4.5 | Multi-head attention | 96+ heads in GPT-4 |
| 4.6 | Masked attention | Causal masking |
| 4.7 | Complete attention module | Same pattern |
Helpful?