Language
On natural language processing, the strangeness of how meaning is encoded in text, and what it might mean for a machine — or a person — to understand.
-
Transformer Series (0/9) — Complete Architecture Walkthrough with Code
The full decoder-only Transformer in working PyTorch — from BPE tokenization to text generation, with every piece explained.
-
Transformer Series (1/9) — Token Embeddings & d_model
What is d_model, and how does a lookup table become the foundation of every LLM?
-
Transformer Series (2/9) — BPE Tokenization
How raw text becomes integers — and why the merge rules matter as much as the vocabulary.
-
Transformer Series (3/9) — Positional Encodings
Attention is permutation-invariant. Without a positional signal, 'dog bites man' and 'man bites dog' look identical.
-
Transformer Series (4/9) — Scaled Dot-Product Attention
The full attention formula, where Q, K, V come from, and why dot products measure relevance.
-
Transformer Series (5/9) — sqrt(d_k) Scaling & the @V Step
Why dividing by sqrt(d_k) isn't arbitrary — and what it means for a token to 'attend' to another.
-
Transformer Series (6/9) — Multi-Head Attention & Causal Masking
One attention head can only look for one thing at a time. Multi-head attention lets different heads specialize — and causal masking keeps training honest.
-
Transformer Series (7/9) — Residual Connections, Layer Norm & FFN
The three components that wrap every attention block — and why without them, deep networks couldn't be trained at all.
-
Transformer Series (8/9) — Encoder vs Decoder Architecture
Why the industry converged on decoder-only — and what was lost when the encoder was left behind.
-
Transformer Series (9/9) — Output Head & Training Essentials
How a hidden state becomes a probability distribution, and what cross-entropy loss, teacher forcing, and learning rate warmup actually do.
-
BPE Tokenization, Under the Hood (2/2)
Pre-tokenization, the merge loop, and the frequency count trick that makes BPE training fast enough to actually use.
-
BPE Tokenization, Explained (1/2)
What tokenization is, why BPE became dominant, and how it compares to other approaches — from word-level to Unigram LM.
-
PyTorch Basics: Building a Two-Layer MLP
Q&A from Day 1 of learning NLP — understanding the MLP pattern that lives at the heart of every transformer.
-
Attention Is All You Need — And All You Are
The transformer architecture offers a strange mirror. What does it mean to attend?