Part 3: Tokenization & Embeddings
Convert text to meaningful number representations that the model can process
What We'll Build
In this part, we'll build the complete input pipeline that converts text into vectors the transformer can process:
| Stage | Output |
|---|---|
| Input | "two plus three" |
| ↓ Tokenizer | [5, 31, 6] |
| ↓ Embedding | 3 vectors of 64 numbers each |
| ↓ + Positions | 3 position-aware vectors |
| Ready for Part 4 | Transformer input |
Here's what we'll build and how it compares to production models:
| Section | What We Do | At Scale (GPT-4, LLaMA) |
|---|---|---|
| Vocabulary | ~36 tokens (manual) | ~100K tokens using BPE |
| Tokenizer | Word split + lookup | Subword tokenization |
| Embeddings | 64 dimensions | 12,288 dimensions |
| Positions | Learned embeddings | RoPE |
A vocabulary is simply a mapping from words to numbers. Every word the model knows gets a unique ID.
Our Vocabulary
For our calculator, we manually list all ~36 words:
vocabulary = {
# Special tokens
"[PAD]": 0, # Padding for batch processing
"[START]": 1, # Start of sequence
"[END]": 2, # End of sequence
# Numbers 0-19
"zero": 3, "one": 4, "two": 5, "three": 6, "four": 7,
"five": 8, "six": 9, "seven": 10, "eight": 11, "nine": 12,
"ten": 13, "eleven": 14, "twelve": 15, "thirteen": 16,
"fourteen": 17, "fifteen": 18, "sixteen": 19, "seventeen": 20,
"eighteen": 21, "nineteen": 22,
# Tens
"twenty": 23, "thirty": 24, "forty": 25, "fifty": 26,
"sixty": 27, "seventy": 28, "eighty": 29, "ninety": 30,
# Operations
"plus": 31, "minus": 32, "times": 33, "divided": 34, "by": 35,
}Each word becomes its ID: "two" → 5, "plus" → 31, "three" → 6
Special tokens: [START] and [END] mark sequence boundaries—our tokenizer adds these automatically. The model learns that [END] means "stop generating." [PAD] fills shorter sequences when batching.
At Scale: BPE Tokenization
Real models like GPT-4 and LLaMA use Byte Pair Encoding (BPE) to automatically build vocabularies of ~100,000 tokens:
- Words split into subwords: "unhappiness" → ["un", "happi", "ness"]
- Handles any word, any language, even misspellings
- Vocabulary learned from training corpus, not manually created
- Tools:
sentencepiece,tiktoken, Hugging Facetokenizers
The tokenizer converts text to token IDs and back. It has two main methods:
class Tokenizer:
def __init__(self, vocabulary: dict[str, int]):
self.word_to_id = vocabulary
self.id_to_word = {id: word for word, id in vocabulary.items()}
def normalize(self, text: str) -> str:
"""Handle variations like 'thirtysix' or '+'."""
# Split compound numbers: "thirtysix" → "thirty six"
tens = ["twenty", "thirty", "forty", "fifty",
"sixty", "seventy", "eighty", "ninety"]
units = ["one", "two", "three", "four", "five",
"six", "seven", "eight", "nine"]
for ten in tens:
for unit in units:
text = text.replace(ten + unit, ten + " " + unit)
# Replace symbols with words
text = text.replace("+", " plus ").replace("-", " minus ")
text = text.replace("*", " times ").replace("/", " divided by ")
text = text.replace(",", "").replace(".", "").replace("?", "")
return text.lower()
def encode(self, text: str) -> list[int]:
"""Convert text to token IDs with [START] and [END]."""
text = self.normalize(text)
words = text.split()
ids = [self.word_to_id["[START]"]]
ids += [self.word_to_id[word] for word in words]
ids += [self.word_to_id["[END]"]]
return ids
def decode(self, ids: list[int]) -> str:
"""Convert token IDs back to text."""
words = [self.id_to_word[id] for id in ids]
return " ".join(words)Usage:
tokenizer = Tokenizer(vocabulary)
# Standard input
ids = tokenizer.encode("two plus three")
print(ids) # [1, 5, 31, 6, 2]
# [START] "two" "plus" "three" [END]
# Also handles variations!
ids = tokenizer.encode("thirtysix + seventytwo")
print(ids) # [1, 24, 9, 31, 28, 5, 2]
# [START] thirty six plus seventy two [END]
# Decode: IDs → text
text = tokenizer.decode([1, 5, 31, 6, 2])
print(text) # "[START] two plus three [END]"Why normalize? Users might type "thirtysix" (no space) or use symbols like "+". The normalize method splits compound words and converts symbols to our vocabulary words. This makes the demo robust without complicating the core tokenization logic.
At Scale
Production tokenizers use subword algorithms but have the same interface:
# OpenAI's tokenizer
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
ids = enc.encode("Hello, world!") # [9906, 11, 1917, 0]
# Hugging Face tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
ids = tokenizer.encode("Hello, world!") # [15496, 11, 995, 0]text → token IDs → text. The only difference is how the vocabulary is built and how words are split.Token IDs are just numbers—they don't capture meaning. The embedding layer converts each ID into a vector (list of numbers) that represents the word's meaning.
Quick PyTorch Primer
We'll use PyTorch, the most popular deep learning library. Here's what you need to know:
torch.Tensor— A multi-dimensional array (like NumPy) that can run on GPUtorch.nn— Neural network building blocks (layers, loss functions)nn.Module— Base class for all neural network components. You inherit from it to create custom layersnn.Embedding— A lookup table that maps integer IDs to vectors
nn.Module. This gives us automatic parameter tracking, GPU support, and easy saving/loading. The forward() method defines what happens when data passes through the layer.The Embedding Layer
import torch
import torch.nn as nn
class Embedding(nn.Module):
def __init__(self, vocab_size: int = 36, embed_dim: int = 64):
super().__init__()
# Create a lookup table: vocab_size rows, embed_dim columns
self.embedding = nn.Embedding(vocab_size, embed_dim)
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
# Look up each token ID to get its vector
return self.embedding(token_ids)Usage:
embed = Embedding(vocab_size=36, embed_dim=64)
token_ids = torch.tensor([5, 31, 6]) # "two plus three"
vectors = embed(token_ids)
print(vectors.shape) # torch.Size([3, 64])
# 3 tokens, each represented by 64 numbersWhat's Inside?
The embedding layer is just a lookup table of random numbers that get adjusted during training:
# Inside the embedding layer (simplified)
# dim0 dim1 dim2 ... dim63
# ID 0: [0.23, -0.45, 0.12, ..., 0.67] ← "[PAD]"
# ID 1: [0.89, 0.34, -0.56, ..., 0.23] ← "[START]"
# ID 2: [0.12, 0.78, 0.45, ..., -0.34] ← "[END]"
# ID 3: [0.45, -0.23, 0.89, ..., 0.12] ← "zero"
# ID 4: [0.67, 0.12, -0.45, ..., 0.56] ← "one"
# ID 5: [0.34, 0.56, 0.23, ..., -0.78] ← "two"
# ...At Scale
| Model | Vocab Size | Embed Dim | Embedding Parameters |
|---|---|---|---|
| Our Calculator | 36 | 64 | 2,304 |
| GPT-2 | 50,257 | 768 | 38.6 million |
| GPT-3 | 50,257 | 12,288 | 617 million |
| GPT-4 | ~100,000 | ~12,288 | ~1.2 billion |
Embeddings don't know word order—"two plus three" and "three plus two" would have the same vectors in different positions. We fix this by adding position information to each embedding.
class PositionalEncoding(nn.Module):
def __init__(self, max_seq_len: int = 32, embed_dim: int = 64):
super().__init__()
# Learnable position embeddings
self.pos_embedding = nn.Embedding(max_seq_len, embed_dim)
def forward(self, embeddings: torch.Tensor) -> torch.Tensor:
seq_len = embeddings.size(0)
positions = torch.arange(seq_len) # [0, 1, 2, ...]
pos_vectors = self.pos_embedding(positions)
return embeddings + pos_vectors # Add position infoNow the same word at different positions has different vectors:
# "three" at position 0 vs position 2
# vector("three", pos=0) ≠ vector("three", pos=2)
# This lets the model distinguish:
# "five minus three" → 2 (five at pos 0)
# "three minus five" → -2 (three at pos 0)At Scale: Position Encoding Methods
| Model | Position Method | Max Length |
|---|---|---|
| Original Transformer | Sinusoidal (fixed) | 512 |
| GPT-2 | Learned | 1,024 |
| GPT-3 | Learned | 2,048 |
| GPT-4 | RoPE | 8,000-128,000 |
| LLaMA | RoPE | 4,000-100,000+ |
RoPE (Rotary Position Embedding) is the modern standard:
- Encodes relative position, not just absolute
- Can extrapolate to longer sequences than trained on
- Mathematically elegant (rotates vectors based on position)
Let's combine everything into a single class:
class InputEmbedding(nn.Module):
def __init__(self, vocab_size: int = 36, embed_dim: int = 64, max_seq_len: int = 32):
super().__init__()
self.tokenizer = Tokenizer(vocabulary)
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.pos_embedding = nn.Embedding(max_seq_len, embed_dim)
def forward(self, text: str) -> torch.Tensor:
# Step 1: Tokenize
token_ids = self.tokenizer.encode(text)
token_ids = torch.tensor(token_ids)
# Step 2: Get token embeddings
embeddings = self.embedding(token_ids)
# Step 3: Add position embeddings
positions = torch.arange(len(token_ids))
pos_embeddings = self.pos_embedding(positions)
embeddings = embeddings + pos_embeddings
# Step 4: Add batch dimension [Seq, Dim] -> [Batch, Seq, Dim]
return embeddings.unsqueeze(0)Usage:
input_layer = InputEmbedding()
output = input_layer("two plus three")
print(output.shape) # torch.Size([1, 3, 64])
# Batch=1, Seq=3 tokens, Dim=64
# Ready for the transformer!At Scale
# Same pattern, different numbers
input_layer = InputEmbedding(
vocab_size=100000,
embed_dim=12288,
max_seq_len=8192
)
output = input_layer("Hello, how are you today?")
print(output.shape) # torch.Size([7, 12288])Summary
Here's what we built and how it compares to GPT-4:
| Component | Our Model | GPT-4 | Ratio |
|---|---|---|---|
| Vocabulary | 36 | ~100,000 | 2,800× |
| Embedding dim | 64 | 12,288 | 192× |
| Embedding params | 2,304 | ~1.2B | 520,000× |
| Max sequence | 32 | 128,000 | 4,000× |
What You Can Now Do
- Build a vocabulary for any task
- Convert text to token IDs and back
- Convert token IDs to embeddings
- Add position information to embeddings
- Understand how this scales to real models