Part 1: Foundations
Build the mental model of how LLMs work before writing any code
This is a high-level, simplified explanation. We're intentionally oversimplifying to help you understand the core concepts first. The details will come in later parts.
What is an LLM?
Large because it has billions of parameters (the numbers the model adjusts during training to get better at predictions). Language because it works with text. Model because it's a mathematical function that learns patterns.
An LLM (Large Language Model) does one thing: it predicts the next word.
That's it. Everything else—chat, code generation, reasoning, translation—emerges from this single capability.
Important: The model doesn't "eat" the input and "spit out" an answer. It appends its prediction to the input. So "two plus three" becomes "two plus three five". This is called autoregressive generation—each new word is added to the sequence, then used to predict the next one.
The Core Insight
When you ask an LLM "What is 2+2?", it doesn't "think" or "calculate". It predicts that after the sequence of words "What is 2+2?", the most likely next words are "2+2 equals 4" or simply "4".
It learned this by reading billions of examples where questions were followed by answers.
Common Misconceptions
| Misconception | Reality |
|---|---|
| "It understands" | It predicts patterns |
| "It thinks" | It does matrix math |
| "It knows things" | It learned statistical relationships |
| "It remembers our chat" | Each input is processed fresh* |
The Analogy
Think of your phone's autocomplete, but trained on the entire internet and scaled up a million times. When you type "How are", your phone suggests "you" because that pattern appears frequently. An LLM does the same thing, just with much longer contexts and much more sophisticated pattern matching.
Key takeaway: An LLM is a very sophisticated pattern-matching machine that predicts what text should come next.
The Complete Pipeline
Let's trace the complete journey of a single question through our calculator model:
| Input | Output |
|---|---|
| "two plus three" | "five" |
At a high level, "five" is the best and closest word to "two plus three" based on the model's training. The model learned that when it sees "two plus three", the most probable next word is "five".
Let's understand how the model finds its path to "five":

Computers don't understand words—they only understand numbers. So we need to convert text into numbers. We do this by creating a vocabulary: a list of all words the model knows, where each word gets a unique ID.
For our calculator, the vocabulary might look like:
{ "zero": 0, "one": 1, "two": 2, "three": 3, ... "plus": 12, "minus": 13, ... }Now we can convert our input:
"two plus three"
↓
["two", "plus", "three"] → split into words
↓
[2, 12, 3] → look up each word's IDEach word becomes its ID from the vocabulary. The model never sees "two"—it only sees the number 2.

A single number like "2" doesn't tell the model much. Is "two" similar to "three"? Is it related to "plus"? To capture these relationships, we convert each token ID into a vector—a list of numbers that represents the word's meaning.
Think of it like coordinates. In a city, "123 Main St" tells you exactly where something is. Similarly, a vector like [0.2, -0.5, 0.8] places a word in "meaning space":
"two" → [0.9, 0.1, 0.2, ...] (e.g. 64 numbers)
"three" → [0.85, 0.15, 0.25, ...] (e.g. 64 numbers)
"plus" → [0.1, 0.8, 0.3, ...] (e.g. 64 numbers)Why so many numbers? The number of values (e.g. 64) is called the "embedding dimension"—a choice we make when designing the model.
Imagine describing a person with just 2 numbers: height and weight. That's useful, but limited. Now add age, income, years of education—each number captures a new dimension of who they are. With more numbers, you can distinguish between people more precisely.
The same idea applies to words:
- 2 dimensions: Can barely distinguish words
- 64 dimensions: Enough for a simple task like our calculator
- 12,000+ dimensions: What models like GPT-4 use to capture nuance in all of human language
More dimensions = more detail, but slower training. We'll use 64 as an example throughout this tutorial—it's small enough to understand but powerful enough for our calculator.
What do the individual numbers mean? We don't know! The model learns these values during training. But for our calculator, we can imagine what some dimensions might capture:
| Dimension | What it might represent | "two" | "plus" |
|---|---|---|---|
| #1 | Is it a number word? | 0.95 | 0.05 |
| #2 | Is it an operation? | 0.05 | 0.92 |
| #3 | Is it a small number (0-10)? | 0.88 | 0.00 |
| #4 | Is it addition-related? | 0.10 | 0.95 |
| ... | ... | ... | ... |
| #64 | (something the model found useful) | 0.23 | 0.67 |
Notice how "two" and "three" have similar vectors (both are numbers), while "plus" is quite different (it's an operation). Words with similar meanings end up close together in this space.
Our input becomes:
[2, 12, 3] → [[0.9, 0.1, ...], [0.1, 0.8, ...], [0.85, 0.15, ...]]
"two" "plus" "three"
We have a problem. Look at these two inputs:
"five minus three" → answer: "two"
"three minus five" → answer: "negative two"The words are the same, but the order matters. With just embeddings, the model sees the same three vectors in both cases—it doesn't know which word came first!
The solution: we add position information to each embedding. Think of it like seat numbers in a theater—each word gets a fixed position marker:
"two plus three"
Position 1: "two" → [0.9, 0.1, ...] + [position 1 info] → [0.92, 0.15, ...]
Position 2: "plus" → [0.1, 0.8, ...] + [position 2 info] → [0.13, 0.85, ...]
Position 3: "three" → [0.85, 0.15, ...] + [position 3 info] → [0.88, 0.21, ...]Now each vector contains both what the word is AND where it appears. The same word at different positions will have slightly different vectors.

Imagine you're reading a sentence and someone asks you what it means. You don't look at each word separately—you naturally consider how words relate to each other.
Right now, our model has three separate vectors for "two", "plus", and "three". But these vectors are like three people in separate rooms—they can't talk to each other. To solve "two plus three", the model needs to understand the relationship between these words.
This is what "attention" does. Think of it like a group discussion:
Imagine the words sitting in a meeting room:
"plus": Hey everyone, I'm an operation. Who am I working with?
"two": I'm a number! I'm sitting before you.
"three": I'm also a number! I'm sitting after you.
"plus": Got it—I need to ADD "two" and "three" together.In technical terms, each word asks a question ("what's relevant to me?") and every other word offers an answer ("here's my information"). The model assigns an attention weight to each pair—a score from 0 to 1 indicating importance.
When "plus" looks at the other words, it assigns high weights (0.8) to "two" and "three" because they're relevant, and low weights (0.1) to irrelevant words. These weights determine how much each word influences the final understanding.
After this "discussion", each word's vector gets updated with information from the others:
| Word | Before attention | After attention |
|---|---|---|
| "two" | I'm the number 2 | I'm the number 2, AND I'm being added to something |
| "plus" | I'm an addition operation | I'm adding the number before me to the number after me |
| "three" | I'm the number 3 | I'm the number 3, AND I'm being added to something |
The transformer repeats this "discussion" through multiple layers (we'll use 2-4). Each round of discussion refines the understanding:
- Layer 1: Basic relationships ("plus connects two numbers")
- Layer 2: Deeper understanding ("we're computing 2 + 3")
- Layer 3: Final insight ("the answer should be 5")
Here's a simplified view of how learning works:
Training example #1:
Input: "one plus two" → Model guesses: "seven" (wrong!)
Correct answer: "three"
Model adjusts: "Hmm, I should pay more attention to 'one' and 'two' when I see 'plus'"
Training example #2:
Input: "four plus three" → Model guesses: "six" (closer!)
Correct answer: "seven"
Model adjusts: "I'm getting better at addition, but need more practice"
...after 1000 examples...
Training example #1000:
Input: "five plus one" → Model guesses: "six" (correct!)
Model has learned: when "plus" appears, add the numbers around itEach wrong answer nudges the model's internal numbers (weights) slightly. After thousands of nudges, the model has "learned" that 'plus' means addition—without us ever explicitly programming that rule.
It's like teaching a child math: you don't explain the neural pathways in their brain—you just show them "2 + 3 = 5" enough times, and their brain figures out the patterns.

After the transformer layers, the model has a rich understanding of "two plus three". But it's still just vectors—we need an actual answer.
The output layer asks: "Given everything I've learned about this input, which word in my vocabulary is most likely to come next?"
It scores every word in the vocabulary:
For input "two plus three", the model scores each possible answer:
"zero" → 0.1%
"one" → 0.2%
"two" → 0.3%
"three" → 0.5%
"four" → 2.1%
"five" → 94.2% ← highest!
"six" → 1.8%
"seven" → 0.4%
...
"plus" → 0.01%
"minus" → 0.01%These percentages are called probabilities. They add up to 100% across all words in the vocabulary. The model is saying: "I'm 94.2% confident the answer is 'five'."
The Math Behind It
How does the model convert vectors into probabilities? Two steps:
1. Linear layer: Multiply the final vector by a weight matrix to get a "score" for each word. Higher score = model thinks this word is more likely.
final_vector (64 numbers) × weight_matrix (64 × 30) = scores (30 numbers)
scores = [-2.1, -1.8, -1.5, -0.9, 0.3, 4.2, 0.1, -0.5, ...]
zero one two three four five six sevenWhere does the weight matrix come from? We create it, and the model learns its values during training.
- 64 = our embedding dimension (the size of each word vector)
- 30 = our vocabulary size (how many words the model knows)
2. Softmax: Convert raw scores into probabilities (0-100%) that sum to 100%. The formula is:
probability(word) = e^(score for word) / sum of e^(all scores)
For "five" with score 4.2:
e^4.2 = 66.7
sum of all e^scores = 70.8
probability = 66.7 / 70.8 = 94.2%Softmax has a useful property: it makes high scores much higher and low scores much lower. A score of 4.2 vs 0.3 becomes 94.2% vs 2.1%. This makes the model "confident" in its best guess.

We now have probabilities for every word. How do we pick the final answer? There are a few strategies:
Strategy 1: Greedy (pick the highest)
The simplest approach: always pick the word with the highest probability.
"five" → 94.2% ← Pick this one!
"four" → 2.1%
"six" → 1.8%
...
Output: "five"This is called greedy decoding. It's deterministic—the same input always gives the same output. Perfect for math where there's only one right answer.
Strategy 2: Sampling (add randomness)
Instead of always picking the top word, we randomly choose based on the probabilities. Higher probability = more likely to be chosen, but not guaranteed.
Run 1: "five" (94.2% chance → picked!)
Run 2: "five" (94.2% chance → picked!)
Run 3: "four" (2.1% chance → lucky pick!)
Run 4: "five" (94.2% chance → picked!)This adds variety. When writing a story, you don't want the same words every time. Sampling makes the model more creative.
Strategy 3: Temperature (control randomness)
We can adjust how "confident" the model is using a parameter called temperature:
- Low temperature (0.1): Makes high probabilities even higher. Model becomes very confident, less creative.
- Temperature = 1: Use probabilities as-is.
- High temperature (2.0): Flattens probabilities. Model becomes more random, more creative.
Original: "five" 94.2%, "four" 2.1%, "six" 1.8%
Low temp: "five" 99.9%, "four" 0.05%, "six" 0.03% (almost certain)
High temp: "five" 60%, "four" 15%, "six" 12% (more random)What Do Real Models Use?
| Model/Use Case | Strategy | Why |
|---|---|---|
| ChatGPT (default) | Sampling + Temperature ~0.7 | Balanced creativity and coherence |
| Code generation (Copilot) | Low temperature ~0.2 | Code needs to be precise and correct |
| Creative writing | Higher temperature ~1.0+ | More surprising and varied outputs |
| Math/Reasoning | Greedy or very low temp | Only one right answer |
| Our calculator | Greedy | Math has no room for creativity! |
And that's it! The model outputs "five", and we've successfully computed "two plus three" = "five".
Summary
Here's what we did to convert "two plus three" into "five":
- Tokenization — Split text into words, convert to IDs → [2, 12, 3]
- Embedding — Convert each ID to a vector of 64 numbers → 3 vectors
- Positional Encoding — Add position info to each vector → 3 position-aware vectors
- Transformer — Let vectors "talk" via attention → 3 context-enriched vectors
- Output Layer — Score every word, convert to probabilities → "five" = 94.2%
- Generation — Pick the highest probability word → "five"