Step 5: Output Layer

Loading...
Output layer factory illustration showing vocabulary scoring and selecting the highest probability answer
Every word gets a probability score

The fifth station is like a scoring machine with meters for every word in the vocabulary. After all the processing, this machine evaluates: "How likely is each word to be the correct next word?"

After the transformer layers, the model has a rich understanding of "two plus three". But it's still just vectors—we need an actual answer.

The output layer asks: "Given everything I've learned about this input, which word in my vocabulary is most likely to come next?"

It scores every word in the vocabulary:

For input "two plus three", the model scores each possible answer:

"zero"   → 0.1%
"one"    → 0.2%
"two"    → 0.3%
"three"  → 0.5%
"four"   → 2.1%
"five"   → 94.2%  ← highest!
"six"    → 1.8%
"seven"  → 0.4%
...
"plus"   → 0.01%
"minus"  → 0.01%

These percentages are called probabilities. They add up to 100% across all words in the vocabulary. The model is saying: "I'm 94.2% confident the answer is 'five'."

From Vectors to Probabilities: Logits and Softmax

Converting the model's internal understanding into an answer happens in two critical steps. This is where many learners get confused, so let's break it down clearly.

Step A: Logits (Raw Scores)

First, we multiply the final vector by a weight matrix to get a score for each word. These raw scores are called logits (from "log-odds" in statistics).

final_vector (64 numbers) × weight_matrix (64 × 30) = logits (30 numbers)

logits = [-2.1, -1.8, -1.5, -0.9, 0.3, 4.2, 0.1, -0.5, ...]
           zero   one   two  three four  five  six seven

Key insight: Logits are not probabilities! They can be negative, they can be any number, and they don't add up to anything meaningful. They're just raw "preference scores" — higher means the model prefers that word.

Where does the weight matrix come from? We create it, and the model learns its values during training.

  • 64 = our embedding dimension (the size of each word vector)
  • 30 = our vocabulary size (how many words the model knows)
Initially, this matrix is filled with random numbers. During training, these numbers get adjusted so that the correct answer gets the highest score. This is what "learning" means—the model is tuning these numbers to give better predictions.

Step B: Softmax (Logits → Probabilities)

Now we need to convert these messy logits into clean probabilities. This is what softmax does:

BEFORE (Logits - raw scores):
  zero: -2.1    one: -1.8    two: -1.5    three: -0.9
  four:  0.3    five: 4.2    six:  0.1    seven: -0.5

  → Can be negative ❌
  → Don't sum to anything meaningful ❌
  → Hard to interpret ❌

                    ↓ SOFTMAX ↓

AFTER (Probabilities):
  zero: 0.1%    one: 0.2%    two: 0.3%    three: 0.5%
  four: 2.1%    five: 94.2%  six: 1.8%    seven: 0.4%

  → All positive ✓
  → Sum to exactly 100% ✓
  → Easy to interpret as confidence ✓

The softmax formula amplifies differences: a logit of 4.2 vs 0.3 becomes 94.2% vs 2.1%. The highest score "wins" decisively.

Summary: Logits are the model's raw preferences (messy numbers). Softmax converts them into probabilities (clean percentages that sum to 100%). This is how the model expresses "I'm 94.2% confident the answer is 'five'."
Helpful?