Back to OverviewPart 3 of 8

Part 3: Tokenization & Embeddings

Convert text to meaningful number representations that the model can process

What We'll Build

In this part, we'll build the complete input pipeline that converts text into vectors the transformer can process:

StageOutput
Input"two plus three"
↓ Tokenizer[5, 31, 6]
↓ Embedding3 vectors of 64 numbers each
↓ + Positions3 position-aware vectors
Ready for Part 4Transformer input

Here's what we'll build and how it compares to production models:

SectionWhat We DoAt Scale (GPT-4, LLaMA)
Vocabulary~36 tokens (manual)~100K tokens using BPE
TokenizerWord split + lookupSubword tokenization
Embeddings64 dimensions12,288 dimensions
PositionsLearned embeddingsRoPE

A vocabulary is simply a mapping from words to numbers. Every word the model knows gets a unique ID.

Our Vocabulary

For our calculator, we manually list all ~36 words:

vocabulary = {
    # Special tokens
    "[PAD]": 0,    # Padding for batch processing
    "[START]": 1,  # Start of sequence
    "[END]": 2,    # End of sequence

    # Numbers 0-19
    "zero": 3, "one": 4, "two": 5, "three": 6, "four": 7,
    "five": 8, "six": 9, "seven": 10, "eight": 11, "nine": 12,
    "ten": 13, "eleven": 14, "twelve": 15, "thirteen": 16,
    "fourteen": 17, "fifteen": 18, "sixteen": 19, "seventeen": 20,
    "eighteen": 21, "nineteen": 22,

    # Tens
    "twenty": 23, "thirty": 24, "forty": 25, "fifty": 26,
    "sixty": 27, "seventy": 28, "eighty": 29, "ninety": 30,

    # Operations
    "plus": 31, "minus": 32, "times": 33, "divided": 34, "by": 35,
}

Each word becomes its ID: "two"5, "plus"31, "three"6

Special tokens: [START] and [END] mark sequence boundaries—our tokenizer adds these automatically. The model learns that [END] means "stop generating." [PAD] fills shorter sequences when batching.

At Scale: BPE Tokenization

Real models like GPT-4 and LLaMA use Byte Pair Encoding (BPE) to automatically build vocabularies of ~100,000 tokens:

  • Words split into subwords: "unhappiness" → ["un", "happi", "ness"]
  • Handles any word, any language, even misspellings
  • Vocabulary learned from training corpus, not manually created
  • Tools: sentencepiece, tiktoken, Hugging Face tokenizers
Why subwords? With word-level tokens, "running" and "runs" are completely different. With subwords, both share "run" and the model learns they're related.

The tokenizer converts text to token IDs and back. It has two main methods:

class Tokenizer:
    def __init__(self, vocabulary: dict[str, int]):
        self.word_to_id = vocabulary
        self.id_to_word = {id: word for word, id in vocabulary.items()}

    def normalize(self, text: str) -> str:
        """Handle variations like 'thirtysix' or '+'."""
        # Split compound numbers: "thirtysix" → "thirty six"
        tens = ["twenty", "thirty", "forty", "fifty",
                "sixty", "seventy", "eighty", "ninety"]
        units = ["one", "two", "three", "four", "five",
                 "six", "seven", "eight", "nine"]
        for ten in tens:
            for unit in units:
                text = text.replace(ten + unit, ten + " " + unit)

        # Replace symbols with words
        text = text.replace("+", " plus ").replace("-", " minus ")
        text = text.replace("*", " times ").replace("/", " divided by ")
        text = text.replace(",", "").replace(".", "").replace("?", "")
        return text.lower()

    def encode(self, text: str) -> list[int]:
        """Convert text to token IDs with [START] and [END]."""
        text = self.normalize(text)
        words = text.split()

        ids = [self.word_to_id["[START]"]]
        ids += [self.word_to_id[word] for word in words]
        ids += [self.word_to_id["[END]"]]
        return ids

    def decode(self, ids: list[int]) -> str:
        """Convert token IDs back to text."""
        words = [self.id_to_word[id] for id in ids]
        return " ".join(words)

Usage:

tokenizer = Tokenizer(vocabulary)

# Standard input
ids = tokenizer.encode("two plus three")
print(ids)  # [1, 5, 31, 6, 2]
#             [START] "two" "plus" "three" [END]

# Also handles variations!
ids = tokenizer.encode("thirtysix + seventytwo")
print(ids)  # [1, 24, 9, 31, 28, 5, 2]
#             [START] thirty six plus seventy two [END]

# Decode: IDs → text
text = tokenizer.decode([1, 5, 31, 6, 2])
print(text)  # "[START] two plus three [END]"

Why normalize? Users might type "thirtysix" (no space) or use symbols like "+". The normalize method splits compound words and converts symbols to our vocabulary words. This makes the demo robust without complicating the core tokenization logic.

At Scale

Production tokenizers use subword algorithms but have the same interface:

# OpenAI's tokenizer
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
ids = enc.encode("Hello, world!")  # [9906, 11, 1917, 0]

# Hugging Face tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
ids = tokenizer.encode("Hello, world!")  # [15496, 11, 995, 0]
Same concept: text → token IDs → text. The only difference is how the vocabulary is built and how words are split.

Token IDs are just numbers—they don't capture meaning. The embedding layer converts each ID into a vector (list of numbers) that represents the word's meaning.

Quick PyTorch Primer

We'll use PyTorch, the most popular deep learning library. Here's what you need to know:

  • torch.Tensor — A multi-dimensional array (like NumPy) that can run on GPU
  • torch.nn — Neural network building blocks (layers, loss functions)
  • nn.Module — Base class for all neural network components. You inherit from it to create custom layers
  • nn.Embedding — A lookup table that maps integer IDs to vectors
Every layer we build will be an nn.Module. This gives us automatic parameter tracking, GPU support, and easy saving/loading. The forward() method defines what happens when data passes through the layer.

The Embedding Layer

import torch
import torch.nn as nn

class Embedding(nn.Module):
    def __init__(self, vocab_size: int = 36, embed_dim: int = 64):
        super().__init__()
        # Create a lookup table: vocab_size rows, embed_dim columns
        self.embedding = nn.Embedding(vocab_size, embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # Look up each token ID to get its vector
        return self.embedding(token_ids)

Usage:

embed = Embedding(vocab_size=36, embed_dim=64)

token_ids = torch.tensor([5, 31, 6])  # "two plus three"
vectors = embed(token_ids)

print(vectors.shape)  # torch.Size([3, 64])
# 3 tokens, each represented by 64 numbers

What's Inside?

The embedding layer is just a lookup table of random numbers that get adjusted during training:

# Inside the embedding layer (simplified)
#        dim0   dim1   dim2  ... dim63
# ID 0: [0.23, -0.45, 0.12, ..., 0.67]  ← "[PAD]"
# ID 1: [0.89, 0.34, -0.56, ..., 0.23]  ← "[START]"
# ID 2: [0.12, 0.78, 0.45, ..., -0.34]  ← "[END]"
# ID 3: [0.45, -0.23, 0.89, ..., 0.12]  ← "zero"
# ID 4: [0.67, 0.12, -0.45, ..., 0.56]  ← "one"
# ID 5: [0.34, 0.56, 0.23, ..., -0.78]  ← "two"
# ...

At Scale

ModelVocab SizeEmbed DimEmbedding Parameters
Our Calculator36642,304
GPT-250,25776838.6 million
GPT-350,25712,288617 million
GPT-4~100,000~12,288~1.2 billion
GPT-4's embedding layer alone (~1.2B parameters) is 500,000× larger than our entire embedding layer. Same concept, vastly different scale.

Embeddings don't know word order—"two plus three" and "three plus two" would have the same vectors in different positions. We fix this by adding position information to each embedding.

class PositionalEncoding(nn.Module):
    def __init__(self, max_seq_len: int = 32, embed_dim: int = 64):
        super().__init__()
        # Learnable position embeddings
        self.pos_embedding = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, embeddings: torch.Tensor) -> torch.Tensor:
        seq_len = embeddings.size(0)
        positions = torch.arange(seq_len)  # [0, 1, 2, ...]
        pos_vectors = self.pos_embedding(positions)
        return embeddings + pos_vectors  # Add position info

Now the same word at different positions has different vectors:

# "three" at position 0 vs position 2
# vector("three", pos=0) ≠ vector("three", pos=2)

# This lets the model distinguish:
# "five minus three" → 2  (five at pos 0)
# "three minus five" → -2 (three at pos 0)

At Scale: Position Encoding Methods

ModelPosition MethodMax Length
Original TransformerSinusoidal (fixed)512
GPT-2Learned1,024
GPT-3Learned2,048
GPT-4RoPE8,000-128,000
LLaMARoPE4,000-100,000+

RoPE (Rotary Position Embedding) is the modern standard:

  • Encodes relative position, not just absolute
  • Can extrapolate to longer sequences than trained on
  • Mathematically elegant (rotates vectors based on position)
We use simple learned positions. The concept is identical—tell the model where each token is in the sequence. Limitation: If we train with max_len=32, learned positions fail on position 33—there's no embedding for it. RoPE solves this by encoding relative positions mathematically.

Let's combine everything into a single class:

Production note: In real systems, the tokenizer lives outside the model—tokenization happens on CPU (often in parallel) before tensors hit the GPU. We bundle it here for simplicity.
class InputEmbedding(nn.Module):
    def __init__(self, vocab_size: int = 36, embed_dim: int = 64, max_seq_len: int = 32):
        super().__init__()
        self.tokenizer = Tokenizer(vocabulary)
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_embedding = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, text: str) -> torch.Tensor:
        # Step 1: Tokenize
        token_ids = self.tokenizer.encode(text)
        token_ids = torch.tensor(token_ids)

        # Step 2: Get token embeddings
        embeddings = self.embedding(token_ids)

        # Step 3: Add position embeddings
        positions = torch.arange(len(token_ids))
        pos_embeddings = self.pos_embedding(positions)
        embeddings = embeddings + pos_embeddings

        # Step 4: Add batch dimension [Seq, Dim] -> [Batch, Seq, Dim]
        return embeddings.unsqueeze(0)

Usage:

input_layer = InputEmbedding()
output = input_layer("two plus three")

print(output.shape)  # torch.Size([1, 3, 64])
# Batch=1, Seq=3 tokens, Dim=64
# Ready for the transformer!

At Scale

# Same pattern, different numbers
input_layer = InputEmbedding(
    vocab_size=100000,
    embed_dim=12288,
    max_seq_len=8192
)
output = input_layer("Hello, how are you today?")
print(output.shape)  # torch.Size([7, 12288])
Key insight: The pipeline is identical. Only the scale changes.

Summary

Here's what we built and how it compares to GPT-4:

ComponentOur ModelGPT-4Ratio
Vocabulary36~100,0002,800×
Embedding dim6412,288192×
Embedding params2,304~1.2B520,000×
Max sequence32128,0004,000×

What You Can Now Do

  • Build a vocabulary for any task
  • Convert text to token IDs and back
  • Convert token IDs to embeddings
  • Add position information to embeddings
  • Understand how this scales to real models