Language
On natural language processing, the strangeness of how meaning is encoded in text, and what it might mean for a machine — or a person — to understand.
Foundations
1Tokenization
3-
BPE Tokenization, Explained (1/2)
What tokenization is, why BPE became dominant, and how it compares to other approaches — from word-level to Unigram LM.
-
BPE Tokenization, Under the Hood (2/2)
Pre-tokenization, the merge loop, and the frequency count trick that makes BPE training fast enough to actually use.
-
PyTorch Basics: Building a Two-Layer MLP
Q&A from Day 1 of learning NLP — understanding the MLP pattern that lives at the heart of every transformer.
The Transformer
10-
Transformer (0/9) — Complete Architecture Walkthrough with Code
The full decoder-only Transformer in working PyTorch — from BPE tokenization to text generation, with every piece explained.
-
Transformer (1/9) — Token Embeddings & d_model
What is d_model, and how does a lookup table become the foundation of every LLM?
-
Transformer (2/9) — BPE Tokenization
How raw text becomes integers — and why the merge rules matter as much as the vocabulary.
-
Transformer (3/9) — Positional Encodings
Attention is permutation-invariant. Without a positional signal, 'dog bites man' and 'man bites dog' look identical.
-
Transformer (4/9) — Scaled Dot-Product Attention
The full attention formula, where Q, K, V come from, and why dot products measure relevance.
-
Transformer (5/9) — sqrt(d_k) Scaling & the @V Step
Why dividing by sqrt(d_k) isn't arbitrary — and what it means for a token to 'attend' to another.
-
Transformer (6/9) — Multi-Head Attention & Causal Masking
One attention head can only look for one thing at a time. Multi-head attention lets different heads specialize — and causal masking keeps training honest.
-
Transformer (7/9) — Residual Connections, Layer Norm & FFN
The three components that wrap every attention block — and why without them, deep networks couldn't be trained at all.
-
Transformer (8/9) — Encoder vs Decoder Architecture
Why the industry converged on decoder-only — and what was lost when the encoder was left behind.
-
Transformer (9/9) — Output Head & Training Essentials
How a hidden state becomes a probability distribution, and what cross-entropy loss, teacher forcing, and learning rate warmup actually do.
Efficient Attention
4-
FlashAttention (1/4) — GPU Memory Hierarchy: HBM, SRAM, and Why It Matters
Understanding the two-level memory system in modern GPUs and why some operations are memory-bound while others are compute-bound.
-
FlashAttention (2/4) — Why Attention Is Memory-Bound: The N×N Problem
The attention score matrix is always N×N per head regardless of d_k, and naive attention writes it to HBM three times. Here's why.
-
FlashAttention (3/4) — Tiling and Online Softmax Explained
A step-by-step walkthrough of how FlashAttention eliminates the N×N memory bottleneck using tiling and incremental softmax — with concrete numbers.
-
FlashAttention (4/4) — FlashAttention in Code
A minimal Python/PyTorch implementation of FlashAttention's online softmax and tiling algorithm, with comparison to naive attention.
Training Fundamentals
4-
Training Fundamentals (1/4) — AdamW: Why Every LLM Uses It
From vanilla SGD to Adam to AdamW — how adaptive learning rates and decoupled weight decay became the default optimizer for transformer training.
-
Training Fundamentals (2/4) — Learning Rate Scheduling
Step decay, exponential decay, cosine annealing, and the warmup + cosine pattern that became the standard for transformer training.
-
Training Fundamentals (3/4) — Mixed Precision Training & Gradient Checkpointing
Two essential memory optimization techniques: reducing the size of stored values (FP16/BF16) and reducing the number of stored values (checkpointing).
-
Training Fundamentals (4/4) — Scaling Laws: Kaplan, Chinchilla, and Beyond
How model performance relates to parameters, data, and compute — and why the industry shifted from 'make it bigger' to 'train it longer.'
Evaluation
12-
LLM Evaluation (1/12) — Overview and the Benchmark Treadmill
Goodhart's Law, five phases of benchmark saturation, and why no single evaluation method has proven durably reliable.
-
LLM Evaluation (2/12) — Benchmarks Landscape (1998–2026)
A complete reference table of the benchmarks that shaped LLM evaluation, organized by category.
-
LLM Evaluation (3/12) — Methods Reference Table
Twelve evaluation paradigms, contamination detection methods, and diagnostic studies — each with strengths, weaknesses, and current status.
-
LLM Evaluation (4/12) — Paradigms: Fine-Tuning, Few-Shot, Logit vs. Generative
From BERT-on-SQuAD to GPT-3's few-shot prompting, and the subtle but crucial gap between logit-based and generative evaluation.
-
LLM Evaluation (5/12) — Contamination, Gaming, and the Evaluation Crisis
Three diagnostic studies — C-BOD, GSM1K, option-reordering — and the v1→v2 Open LLM Leaderboard collapse.
-
LLM Evaluation (6/12) — How Model Releases Broke Evaluation
GPT-3, Chinchilla, ChatGPT, GPT-4, LLaMA, o1, DeepSeek-R1 — each release exposed a specific evaluation failure and catalyzed new methods.
-
LLM Evaluation (7/12) — Chatbot Arena: Mechanics, Economics, Controversies
How LMSYS's crowdsourced leaderboard works, where Elo comes from, and why the Llama 4 incident exposed a structural conflict.
-
LLM Evaluation (8/12) — The Null Model Attack on LLM-as-Judge
A model that outputs a single constant string — no reasoning, no task comprehension — achieved 86.5% win rate on AlpacaEval 2.0. How, and what it means.
-
LLM Evaluation (9/12) — Safety: Red-Teaming, Multi-Turn Attacks, Intent Laundering
81% of safety benchmarks test predefined risks with fixed prompts. Multi-turn attacks hit 75% failure rates. Intent laundering hits 90–98%.
-
LLM Evaluation (10/12) — Process-Based Evaluation, PRMs, and Faithfulness
When reasoning matters as much as the final answer: process reward models, the faithfulness problem, and why labeling steps is fundamentally hard.
-
LLM Evaluation (11/12) — Fairness, HELM's Seven Dimensions, and Multilingual Gaps
HELM's multi-dimensional framework, the ReDial dialect study on AAVE, and why multilingual ≠ multicultural.
-
LLM Evaluation (12/12) — Infrastructure, Leaderboards, and the Industry-Academia Divide
EleutherAI, HELM, OpenAI Evals, the Hugging Face Open LLM Leaderboard's rise and retirement, and where evaluation goes from here.
Post-Training
8-
Post-Training (1/8) — Supervised Fine-Tuning (SFT)
How SFT transforms a base model into an instruction-following model using curated conversation data — the first step in the post-training pipeline.
-
Post-Training (2/8) — The Reward Model
Training a model to score response quality using human preference data and the Bradley-Terry framework.
-
Post-Training (3/8) — KL Divergence in RLHF
How KL divergence acts as a leash to prevent the policy model from drifting too far from the original SFT model.
-
Post-Training (4/8) — PPO (Proximal Policy Optimization)
How PPO uses policy gradients, value networks, and clipping to safely update the language model using reward signals.
-
Post-Training (5/8) — DPO (Direct Preference Optimization)
How DPO simplifies RLHF by eliminating the reward model and RL loop, directly fine-tuning on preference data.
-
Post-Training (6/8) — GRPO (Group Relative Policy Optimization)
How GRPO replaces PPO's value network with a group-mean baseline, enabling efficient RL with verifiable rewards.
-
Post-Training (7/8) — Reasoning Training (o1 / DeepSeek-R1)
How RL with verifiable rewards enables language models to discover reasoning strategies from scratch — the R1-Zero experiment and the hybrid R1 pipeline.
-
Post-Training (8/8) — Constitutional AI and RLAIF
How AI feedback grounded in written principles replaces human preference labels — scaling preference learning but widening the proxy chain.