Part 1: Foundations

Build the mental model of how LLMs work before writing any code

This is a high-level, simplified explanation. We're intentionally oversimplifying to help you understand the core concepts first. The details will come in later parts.

What is an LLM?

Large because it has billions of parameters (the numbers the model adjusts during training to get better at predictions). Language because it works with text. Model because it's a mathematical function that learns patterns.

An LLM (Large Language Model) does one thing: it predicts the next word.

That's it. Everything else—chat, code generation, reasoning, translation—emerges from this single capability.

Important: The model doesn't "eat" the input and "spit out" an answer. It appends its prediction to the input. So "two plus three" becomes "two plus three five". This is called autoregressive generation—each new word is added to the sequence, then used to predict the next one.

The Core Insight

When you ask an LLM "What is 2+2?", it doesn't "think" or "calculate". It predicts that after the sequence of words "What is 2+2?", the most likely next words are "2+2 equals 4" or simply "4".

It learned this by reading billions of examples where questions were followed by answers.

Common Misconceptions

Misconception	Reality
"It understands"	It predicts patterns
"It thinks"	It does matrix math
"It knows things"	It learned statistical relationships
"It remembers our chat"	Each input is processed fresh*

*Chat history is included in each new prompt, giving the illusion of memory.

The Analogy

Think of your phone's autocomplete, but trained on the entire internet and scaled up a million times. When you type "How are", your phone suggests "you" because that pattern appears frequently. An LLM does the same thing, just with much longer contexts and much more sophisticated pattern matching.

Key takeaway: An LLM is a very sophisticated pattern-matching machine that predicts what text should come next.

The Complete Pipeline

Let's trace the complete journey of a single question through our calculator model:

Input	Output
"two plus three"	"five"

At a high level, "five" is the best and closest word to "two plus three" based on the model's training. The model learned that when it sees "two plus three", the most probable next word is "five".

Let's understand how the model finds its path to "five":

Tokenization factory illustration showing text being split into tokens and converted to numbers

Computers don't understand words—they only understand numbers. So we need to convert text into numbers. We do this by creating a vocabulary: a list of all words the model knows, where each word gets a unique ID.

For our calculator, the vocabulary might look like:

{ "zero": 0, "one": 1, "two": 2, "three": 3, ... "plus": 12, "minus": 13, ... }

Now we can convert our input:

"two plus three"
 ↓
["two", "plus", "three"]  → split into words
 ↓
[2, 12, 3]                → look up each word's ID

Each word becomes its ID from the vocabulary. The model never sees "two"—it only sees the number 2.

Embedding illustration showing words being placed in a city map where similar words are neighbors

A single number like "2" doesn't tell the model much. Is "two" similar to "three"? Is it related to "plus"? To capture these relationships, we convert each token ID into a vector—a list of numbers that represents the word's meaning.

Think of it like coordinates. In a city, "123 Main St" tells you exactly where something is. Similarly, a vector like [0.2, -0.5, 0.8] places a word in "meaning space":

"two"   → [0.9, 0.1, 0.2, ...]   (e.g. 64 numbers)
"three" → [0.85, 0.15, 0.25, ...]  (e.g. 64 numbers)
"plus"  → [0.1, 0.8, 0.3, ...]    (e.g. 64 numbers)

Why so many numbers? The number of values (e.g. 64) is called the "embedding dimension"—a choice we make when designing the model.

Imagine describing a person with just 2 numbers: height and weight. That's useful, but limited. Now add age, income, years of education—each number captures a new dimension of who they are. With more numbers, you can distinguish between people more precisely.

The same idea applies to words:

2 dimensions: Can barely distinguish words
64 dimensions: Enough for a simple task like our calculator
12,000+ dimensions: What models like GPT-4 use to capture nuance in all of human language

More dimensions = more detail, but slower training. We'll use 64 as an example throughout this tutorial—it's small enough to understand but powerful enough for our calculator.

What do the individual numbers mean? We don't know! The model learns these values during training. But for our calculator, we can imagine what some dimensions might capture:

Dimension	What it might represent	"two"	"plus"
#1	Is it a number word?	0.95	0.05
#2	Is it an operation?	0.05	0.92
#3	Is it a small number (0-10)?	0.88	0.00
#4	Is it addition-related?	0.10	0.95
...	...	...	...
#64	(something the model found useful)	0.23	0.67

In reality, these dimensions are polysemantic—a single dimension might encode multiple unrelated concepts at once. Real embedding spaces are messy and don't map to human-understandable ideas. But the result is the same: similar words get similar vectors.

Notice how "two" and "three" have similar vectors (both are numbers), while "plus" is quite different (it's an operation). Words with similar meanings end up close together in this space.

Our input becomes:

[2, 12, 3] → [[0.9, 0.1, ...], [0.1, 0.8, ...], [0.85, 0.15, ...]]
              "two"          "plus"          "three"

Positional encoding illustration showing how word order matters - same words in different positions give different results

We have a problem. Look at these two inputs:

"five minus three" → answer: "two"
"three minus five" → answer: "negative two"

The words are the same, but the order matters. With just embeddings, the model sees the same three vectors in both cases—it doesn't know which word came first!

The solution: we add position information to each embedding. Think of it like seat numbers in a theater—each word gets a fixed position marker:

"two plus three"

Position 1: "two"   → [0.9, 0.1, ...] + [position 1 info] → [0.92, 0.15, ...]
Position 2: "plus"  → [0.1, 0.8, ...] + [position 2 info] → [0.13, 0.85, ...]
Position 3: "three" → [0.85, 0.15, ...] + [position 3 info] → [0.88, 0.21, ...]

Now each vector contains both what the word is AND where it appears. The same word at different positions will have slightly different vectors.

After this step, "three" at position 1 looks different from "three" at position 3. The model can now tell word order apart.

Transformer attention illustration showing isolated tokens communicating in a meeting room to gain contextual understanding

Imagine you're reading a sentence and someone asks you what it means. You don't look at each word separately—you naturally consider how words relate to each other.

Right now, our model has three separate vectors for "two", "plus", and "three". But these vectors are like three people in separate rooms—they can't talk to each other. To solve "two plus three", the model needs to understand the relationship between these words.

This is what "attention" does. Think of it like a group discussion:

Imagine the words sitting in a meeting room:

"plus": Hey everyone, I'm an operation. Who am I working with?
"two":  I'm a number! I'm sitting before you.
"three": I'm also a number! I'm sitting after you.
"plus": Got it—I need to ADD "two" and "three" together.

In technical terms, each word asks a question ("what's relevant to me?") and every other word offers an answer ("here's my information"). The model assigns an attention weight to each pair—a score from 0 to 1 indicating importance.

When "plus" looks at the other words, it assigns high weights (0.8) to "two" and "three" because they're relevant, and low weights (0.1) to irrelevant words. These weights determine how much each word influences the final understanding.

After this "discussion", each word's vector gets updated with information from the others:

Word	Before attention	After attention
"two"	I'm the number 2	I'm the number 2, AND I'm being added to something
"plus"	I'm an addition operation	I'm adding the number before me to the number after me
"three"	I'm the number 3	I'm the number 3, AND I'm being added to something

The transformer repeats this "discussion" through multiple layers (we'll use 2-4). Each round of discussion refines the understanding:

Layer 1: Basic relationships ("plus connects two numbers")
Layer 2: Deeper understanding ("we're computing 2 + 3")
Layer 3: Final insight ("the answer should be 5")

Important: We don't manually program these relationships! We don't write code that says "when you see 'plus', look at the numbers around it". Instead, we just define the structure (words can attend to other words), and the model learns what to pay attention to by seeing thousands of examples during training.

Here's a simplified view of how learning works:

Training example #1:
  Input: "one plus two"  →  Model guesses: "seven" (wrong!)
  Correct answer: "three"
  Model adjusts: "Hmm, I should pay more attention to 'one' and 'two' when I see 'plus'"

Training example #2:
  Input: "four plus three"  →  Model guesses: "six" (closer!)
  Correct answer: "seven"
  Model adjusts: "I'm getting better at addition, but need more practice"

...after 1000 examples...

Training example #1000:
  Input: "five plus one"  →  Model guesses: "six" (correct!)
  Model has learned: when "plus" appears, add the numbers around it

Each wrong answer nudges the model's internal numbers (weights) slightly. After thousands of nudges, the model has "learned" that 'plus' means addition—without us ever explicitly programming that rule.

It's like teaching a child math: you don't explain the neural pathways in their brain—you just show them "2 + 3 = 5" enough times, and their brain figures out the patterns.

Attention is the key innovation that makes transformers so powerful. We'll build it from scratch in Part 4. For now, just remember: it lets words "talk" to each other and share information—and the model learns what's worth talking about.

Output layer illustration showing vectors being scored and converted to probabilities via softmax

After the transformer layers, the model has a rich understanding of "two plus three". But it's still just vectors—we need an actual answer.

The output layer asks: "Given everything I've learned about this input, which word in my vocabulary is most likely to come next?"

It scores every word in the vocabulary:

For input "two plus three", the model scores each possible answer:

"zero"   → 0.1%
"one"    → 0.2%
"two"    → 0.3%
"three"  → 0.5%
"four"   → 2.1%
"five"   → 94.2%  ← highest!
"six"    → 1.8%
"seven"  → 0.4%
...
"plus"   → 0.01%
"minus"  → 0.01%

These percentages are called probabilities. They add up to 100% across all words in the vocabulary. The model is saying: "I'm 94.2% confident the answer is 'five'."

The Math Behind It

How does the model convert vectors into probabilities? Two steps:

1. Linear layer: Multiply the final vector by a weight matrix to get a "score" for each word. Higher score = model thinks this word is more likely.

final_vector (64 numbers) × weight_matrix (64 × 30) = scores (30 numbers)

scores = [-2.1, -1.8, -1.5, -0.9, 0.3, 4.2, 0.1, -0.5, ...]
           zero   one   two  three four  five  six seven

Where does the weight matrix come from? We create it, and the model learns its values during training.

64 = our embedding dimension (the size of each word vector)
30 = our vocabulary size (how many words the model knows)

Initially, this matrix is filled with random numbers. During training, these numbers get adjusted so that the correct answer gets the highest score. This is what "learning" means—the model is tuning these numbers to give better predictions.

2. Softmax: Convert raw scores into probabilities (0-100%) that sum to 100%. The formula is:

probability(word) = e^(score for word) / sum of e^(all scores)

For "five" with score 4.2:
  e^4.2 = 66.7
  sum of all e^scores = 70.8
  probability = 66.7 / 70.8 = 94.2%

Softmax has a useful property: it makes high scores much higher and low scores much lower. A score of 4.2 vs 0.3 becomes 94.2% vs 2.1%. This makes the model "confident" in its best guess.

Don't worry if the math feels abstract—we'll implement it step by step in Part 4. The key idea: scores → softmax → probabilities.

Generation illustration showing probability scores for words and selecting the highest probability word as output

We now have probabilities for every word. How do we pick the final answer? There are a few strategies:

Strategy 1: Greedy (pick the highest)

The simplest approach: always pick the word with the highest probability.

"five"  → 94.2%  ← Pick this one!
"four"  → 2.1%
"six"   → 1.8%
...

Output: "five"

This is called greedy decoding. It's deterministic—the same input always gives the same output. Perfect for math where there's only one right answer.

Strategy 2: Sampling (add randomness)

Instead of always picking the top word, we randomly choose based on the probabilities. Higher probability = more likely to be chosen, but not guaranteed.

Run 1: "five"  (94.2% chance → picked!)
Run 2: "five"  (94.2% chance → picked!)
Run 3: "four"  (2.1% chance  → lucky pick!)
Run 4: "five"  (94.2% chance → picked!)

This adds variety. When writing a story, you don't want the same words every time. Sampling makes the model more creative.

Strategy 3: Temperature (control randomness)

We can adjust how "confident" the model is using a parameter called temperature:

Low temperature (0.1): Makes high probabilities even higher. Model becomes very confident, less creative.
Temperature = 1: Use probabilities as-is.
High temperature (2.0): Flattens probabilities. Model becomes more random, more creative.

Original:     "five" 94.2%, "four" 2.1%, "six" 1.8%
Low temp:     "five" 99.9%, "four" 0.05%, "six" 0.03%  (almost certain)
High temp:    "five" 60%, "four" 15%, "six" 12%        (more random)

What Do Real Models Use?

Model/Use Case	Strategy	Why
ChatGPT (default)	Sampling + Temperature ~0.7	Balanced creativity and coherence
Code generation (Copilot)	Low temperature ~0.2	Code needs to be precise and correct
Creative writing	Higher temperature ~1.0+	More surprising and varied outputs
Math/Reasoning	Greedy or very low temp	Only one right answer
Our calculator	Greedy	Math has no room for creativity!

For our calculator, we'll use greedy decoding (always pick the highest). Math has right and wrong answers—we don't want creativity here! With high temperature, even GPT-4 might confidently answer "2+2=5" just for variety. Deterministic tasks need deterministic settings.

And that's it! The model outputs "five", and we've successfully computed "two plus three" = "five".

Summary

Here's what we did to convert "two plus three" into "five":

Tokenization — Split text into words, convert to IDs → [2, 12, 3]
Embedding — Convert each ID to a vector of 64 numbers → 3 vectors
Positional Encoding — Add position info to each vector → 3 position-aware vectors
Transformer — Let vectors "talk" via attention → 3 context-enriched vectors
Output Layer — Score every word, convert to probabilities → "five" = 94.2%
Generation — Pick the highest probability word → "five"

Part 2: The Project

Part 1: Foundations

What is an LLM?

The Core Insight

Common Misconceptions

The Analogy

The Complete Pipeline

Step 1: Tokenization

Step 2: Embedding

Step 3: Positional Encoding

Step 4: Transformer Layers (Attention)

Step 5: Output Layer