Build Your First LLM from ScratchPart 3 · Section 11 of 13

Adding Positional Encoding

Embeddings don't know word order—"two plus three" and "three plus two" would have the same vectors in different positions. We fix this by adding position information to each embedding.

class PositionalEncoding(nn.Module):
    def __init__(self, max_seq_len: int = 32, embed_dim: int = 64):
        super().__init__()
        # Learnable position embeddings
        self.pos_embedding = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, embeddings: torch.Tensor) -> torch.Tensor:
        seq_len = embeddings.size(0)
        positions = torch.arange(seq_len)  # [0, 1, 2, ...]
        pos_vectors = self.pos_embedding(positions)
        return embeddings + pos_vectors  # Add position info

Now the same word at different positions has different vectors:

# "three" at position 0 vs position 2
# vector("three", pos=0) ≠ vector("three", pos=2)

# This lets the model distinguish:
# "five minus three" → 2  (five at pos 0)
# "three minus five" → -2 (three at pos 0)

At Scale: Position Encoding Methods

Model	Position Method	Max Length
Original Transformer	Sinusoidal (fixed)	512
GPT-2	Learned	1,024
GPT-3	Learned	2,048
GPT-4	RoPE	8,000-128,000
LLaMA	RoPE	4,000-100,000+

RoPE (Rotary Position Embedding) is the modern standard:

Encodes relative position, not just absolute
Can extrapolate to longer sequences than trained on
Mathematically elegant (rotates vectors based on position)

We use simple learned positions. The concept is identical—tell the model where each token is in the sequence. Limitation: If we train with max_len=32, learned positions fail on position 33—there's no embedding for it. RoPE solves this by encoding relative positions mathematically.

Helpful?

At Scale Complete Input Pipeline