Build Your First LLM from ScratchPart 3 · Section 5 of 13

PyTorch Basics

We'll use PyTorch, the most popular deep learning library (used by OpenAI, Meta, Tesla). Think of it as "NumPy that can learn."

How does it learn? Learning happens during training, not when you query the model. Training is a separate phase where we show the model thousands of examples and adjust its numbers:

# TRAINING PHASE (happens once, before deployment):
for example in training_data:        # Loop through 1000s of examples
    prediction = model(example)      # 1. Model makes a guess
    loss = how_wrong(prediction)     # 2. Measure how wrong it was
    loss.backward()                  # 3. Figure out which numbers caused the error
    optimizer.step()                 # 4. Tweak those numbers slightly

# After training, save the model's learned numbers
model.save("calculator.pt")

# INFERENCE PHASE (happens every time you use the model):
model = load("calculator.pt")        # Load the frozen numbers
answer = model("two plus three")     # Just compute—no learning happens here

The magic is in step 3: loss.backward(). PyTorch tracked every math operation during the forward pass. Now it works backwards ("backpropagation") to calculate: "if I nudge this number up, does the error go down?" It does this for millions of numbers automatically.

This is how attention learns what to attend to. Remember in Part 1, we said "plus" learns to pay attention to the numbers around it? That happens here. Initially, attention weights are random. But when the model predicts "seven" instead of "five" for "two plus three", backpropagation adjusts the attention weights so "plus" pays more attention to "two" and "three" next time.

Key insight: When you chat with ChatGPT, no learning happens—it's just doing math with frozen numbers. Those numbers were learned during training (which cost OpenAI ~$100M). Our calculator will train in minutes because it's tiny.

Tensors: The Building Block

A tensor is just a multi-dimensional array of numbers:

import torch

# A scalar (0D tensor) - just a number
x = torch.tensor(5)

# A vector (1D tensor) - a list of numbers
x = torch.tensor([1, 2, 3])

# A matrix (2D tensor) - rows and columns
x = torch.tensor([[1, 2], [3, 4], [5, 6]])

# Our embeddings will be 2D: [sequence_length, embedding_dim]
# e.g., [3, 64] = 3 tokens, each with 64 numbers

nn.Module: Building Blocks for Neural Networks

Every neural network component in PyTorch inherits from nn.Module. It's like a template that says: "I have learnable parameters and I do something to input data."

import torch.nn as nn

class MyLayer(nn.Module):
    def __init__(self):
        super().__init__()  # Always call this first
        # Define learnable parameters here
        self.weights = nn.Parameter(torch.randn(10, 10))

    def forward(self, x):
        # Define what happens when data passes through
        return x @ self.weights  # @ is matrix multiplication

The __init__ method sets up the layer (runs once). The forward method processes data (runs every time you use the layer).

nn.Embedding: A Lookup Table

PyTorch provides many pre-built layers. nn.Embedding is one we'll use a lot—it's just a table that maps IDs to vectors:

# Create a table: 100 words, each gets a 64-number vector
embed = nn.Embedding(num_embeddings=100, embedding_dim=64)

# Look up word ID 5
vector = embed(torch.tensor([5]))
print(vector.shape)  # [1, 64] - one word, 64 numbers

# Look up multiple words at once
vectors = embed(torch.tensor([5, 12, 3]))
print(vectors.shape)  # [3, 64] - three words, 64 numbers each
The numbers in the embedding table start random. During training, PyTorch adjusts them so similar words get similar vectors. That's what "learning" means—finding good numbers.

What's Learned vs What's Chosen?

This confuses many beginners. Some things are hyperparameters (you choose them before training), others are parameters (learned during training):

You Choose (Hyperparameters)Model Learns (Parameters)
Embedding dimension (64)The actual 64 numbers for each word
Number of layers (4)The weights inside each layer
Vocabulary size (36)Which vectors are similar to which
Learning rate (0.001)Attention patterns (what to focus on)

Think of it like building a house: you choose the blueprint (4 bedrooms, 2 floors), but the construction fills in the actual bricks. The model can't decide "I need 128 dimensions instead of 64"—that's your architectural choice. But what those 64 numbers should be for each word? That's learned.

Helpful?