Siddhartha Lahiri

Build Your First LLM from ScratchPart 3 · Section 13 of 13

Summary

Here's what we built and how it compares to GPT-4:

Component	Our Model	GPT-4	Ratio
Vocabulary	36	~100,000	2,800×
Embedding dim	64	12,288	192×
Embedding params	2,304	~1.2B	520,000×
Max sequence	32	128,000	4,000×

What You Can Now Do

Build a vocabulary for any task
Convert text to token IDs and back
Convert token IDs to embeddings
Add position information to embeddings
Understand how this scales to real models

Helpful?

Complete Input Pipeline Part 4: What We'll Build