Step 5: Output Layer

The fifth station is like a scoring machine with meters for every word in the vocabulary. After all the processing, this machine evaluates: "How likely is each word to be the correct next word?"
After the transformer layers, the model has a rich understanding of "two plus three". But it's still just vectors—we need an actual answer.
The output layer asks: "Given everything I've learned about this input, which word in my vocabulary is most likely to come next?"
It scores every word in the vocabulary:
For input "two plus three", the model scores each possible answer:
"zero" → 0.1%
"one" → 0.2%
"two" → 0.3%
"three" → 0.5%
"four" → 2.1%
"five" → 94.2% ← highest!
"six" → 1.8%
"seven" → 0.4%
...
"plus" → 0.01%
"minus" → 0.01%These percentages are called probabilities. They add up to 100% across all words in the vocabulary. The model is saying: "I'm 94.2% confident the answer is 'five'."
From Vectors to Probabilities: Logits and Softmax
Converting the model's internal understanding into an answer happens in two critical steps. This is where many learners get confused, so let's break it down clearly.
Step A: Logits (Raw Scores)
First, we multiply the final vector by a weight matrix to get a score for each word. These raw scores are called logits (from "log-odds" in statistics).
final_vector (64 numbers) × weight_matrix (64 × 30) = logits (30 numbers)
logits = [-2.1, -1.8, -1.5, -0.9, 0.3, 4.2, 0.1, -0.5, ...]
zero one two three four five six sevenKey insight: Logits are not probabilities! They can be negative, they can be any number, and they don't add up to anything meaningful. They're just raw "preference scores" — higher means the model prefers that word.
Where does the weight matrix come from? We create it, and the model learns its values during training.
- 64 = our embedding dimension (the size of each word vector)
- 30 = our vocabulary size (how many words the model knows)
Step B: Softmax (Logits → Probabilities)
Now we need to convert these messy logits into clean probabilities. This is what softmax does:
BEFORE (Logits - raw scores):
zero: -2.1 one: -1.8 two: -1.5 three: -0.9
four: 0.3 five: 4.2 six: 0.1 seven: -0.5
→ Can be negative ❌
→ Don't sum to anything meaningful ❌
→ Hard to interpret ❌
↓ SOFTMAX ↓
AFTER (Probabilities):
zero: 0.1% one: 0.2% two: 0.3% three: 0.5%
four: 2.1% five: 94.2% six: 1.8% seven: 0.4%
→ All positive ✓
→ Sum to exactly 100% ✓
→ Easy to interpret as confidence ✓The softmax formula amplifies differences: a logit of 4.2 vs 0.3 becomes 94.2% vs 2.1%. The highest score "wins" decisively.