PyTorch Basics
We'll use PyTorch, the most popular deep learning library (used by OpenAI, Meta, Tesla). Think of it as "NumPy that can learn."
How does it learn? Learning happens during training, not when you query the model. Training is a separate phase where we show the model thousands of examples and adjust its numbers:
# TRAINING PHASE (happens once, before deployment):
for example in training_data: # Loop through 1000s of examples
prediction = model(example) # 1. Model makes a guess
loss = how_wrong(prediction) # 2. Measure how wrong it was
loss.backward() # 3. Figure out which numbers caused the error
optimizer.step() # 4. Tweak those numbers slightly
# After training, save the model's learned numbers
model.save("calculator.pt")
# INFERENCE PHASE (happens every time you use the model):
model = load("calculator.pt") # Load the frozen numbers
answer = model("two plus three") # Just compute—no learning happens hereThe magic is in step 3: loss.backward(). PyTorch tracked every math operation during the forward pass. Now it works backwards ("backpropagation") to calculate: "if I nudge this number up, does the error go down?" It does this for millions of numbers automatically.
This is how attention learns what to attend to. Remember in Part 1, we said "plus" learns to pay attention to the numbers around it? That happens here. Initially, attention weights are random. But when the model predicts "seven" instead of "five" for "two plus three", backpropagation adjusts the attention weights so "plus" pays more attention to "two" and "three" next time.
Tensors: The Building Block
A tensor is just a multi-dimensional array of numbers:
import torch
# A scalar (0D tensor) - just a number
x = torch.tensor(5)
# A vector (1D tensor) - a list of numbers
x = torch.tensor([1, 2, 3])
# A matrix (2D tensor) - rows and columns
x = torch.tensor([[1, 2], [3, 4], [5, 6]])
# Our embeddings will be 2D: [sequence_length, embedding_dim]
# e.g., [3, 64] = 3 tokens, each with 64 numbersnn.Module: Building Blocks for Neural Networks
Every neural network component in PyTorch inherits from nn.Module. It's like a template that says: "I have learnable parameters and I do something to input data."
import torch.nn as nn
class MyLayer(nn.Module):
def __init__(self):
super().__init__() # Always call this first
# Define learnable parameters here
self.weights = nn.Parameter(torch.randn(10, 10))
def forward(self, x):
# Define what happens when data passes through
return x @ self.weights # @ is matrix multiplicationThe __init__ method sets up the layer (runs once). The forward method processes data (runs every time you use the layer).
nn.Embedding: A Lookup Table
PyTorch provides many pre-built layers. nn.Embedding is one we'll use a lot—it's just a table that maps IDs to vectors:
# Create a table: 100 words, each gets a 64-number vector
embed = nn.Embedding(num_embeddings=100, embedding_dim=64)
# Look up word ID 5
vector = embed(torch.tensor([5]))
print(vector.shape) # [1, 64] - one word, 64 numbers
# Look up multiple words at once
vectors = embed(torch.tensor([5, 12, 3]))
print(vectors.shape) # [3, 64] - three words, 64 numbers eachWhat's Learned vs What's Chosen?
This confuses many beginners. Some things are hyperparameters (you choose them before training), others are parameters (learned during training):
| You Choose (Hyperparameters) | Model Learns (Parameters) |
|---|---|
| Embedding dimension (64) | The actual 64 numbers for each word |
| Number of layers (4) | The weights inside each layer |
| Vocabulary size (36) | Which vectors are similar to which |
| Learning rate (0.001) | Attention patterns (what to focus on) |
Think of it like building a house: you choose the blueprint (4 bedrooms, 2 floors), but the construction fills in the actual bricks. The model can't decide "I need 128 dimensions instead of 64"—that's your architectural choice. But what those 64 numbers should be for each word? That's learned.