Build Your First LLM from ScratchPart 3 · Section 11 of 13

Adding Positional Encoding

Embeddings don't know word order—"two plus three" and "three plus two" would have the same vectors in different positions. We fix this by adding position information to each embedding.

class PositionalEncoding(nn.Module):
    def __init__(self, max_seq_len: int = 32, embed_dim: int = 64):
        super().__init__()
        # Learnable position embeddings
        self.pos_embedding = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, embeddings: torch.Tensor) -> torch.Tensor:
        seq_len = embeddings.size(0)
        positions = torch.arange(seq_len)  # [0, 1, 2, ...]
        pos_vectors = self.pos_embedding(positions)
        return embeddings + pos_vectors  # Add position info

Now the same word at different positions has different vectors:

# "three" at position 0 vs position 2
# vector("three", pos=0) ≠ vector("three", pos=2)

# This lets the model distinguish:
# "five minus three" → 2  (five at pos 0)
# "three minus five" → -2 (three at pos 0)

At Scale: Position Encoding Methods

ModelPosition MethodMax Length
Original TransformerSinusoidal (fixed)512
GPT-2Learned1,024
GPT-3Learned2,048
GPT-4RoPE8,000-128,000
LLaMARoPE4,000-100,000+

RoPE (Rotary Position Embedding) is the modern standard:

  • Encodes relative position, not just absolute
  • Can extrapolate to longer sequences than trained on
  • Mathematically elegant (rotates vectors based on position)
We use simple learned positions. The concept is identical—tell the model where each token is in the sequence. Limitation: If we train with max_len=32, learned positions fail on position 33—there's no embedding for it. RoPE solves this by encoding relative positions mathematically.
Helpful?