Language

On natural language processing, the strangeness of how meaning is encoded in text, and what it might mean for a machine — or a person — to understand.

Reference

ML Atlas

An interactive field guide to ML models and their building blocks — 18 architectures, 45 modules, one map.

Foundations

Jan 15, 2026
Attention Is All You Need — And All You Are

The transformer architecture offers a strange mirror. What does it mean to attend?

Tokenization

Mar 22, 2026
BPE Tokenization, Explained (1/2)

What tokenization is, why BPE became dominant, and how it compares to other approaches — from word-level to Unigram LM.
Mar 22, 2026
BPE Tokenization, Under the Hood (2/2)

Pre-tokenization, the merge loop, and the frequency count trick that makes BPE training fast enough to actually use.
Mar 22, 2026
PyTorch Basics: Building a Two-Layer MLP

Q&A from Day 1 of learning NLP — understanding the MLP pattern that lives at the heart of every transformer.

The Transformer

Efficient Attention

Training Fundamentals

Inference

Apr 9, 2026
Inference Optimization — KV Cache, Quantization, and Batching

Why autoregressive generation is expensive, and the three core techniques that make serving LLMs practical: caching key-value pairs, reducing precision, and amortizing weight loads across users.

Reference

ML Atlas

Foundations

Attention Is All You Need — And All You Are

Tokenization

BPE Tokenization, Explained (1/2)

BPE Tokenization, Under the Hood (2/2)

PyTorch Basics: Building a Two-Layer MLP

The Transformer

Transformer (0/9) — Complete Architecture Walkthrough with Code

Transformer (1/9) — Token Embeddings & d_model

Transformer (2/9) — BPE Tokenization

Transformer (3/9) — Positional Encodings

Transformer (4/9) — Scaled Dot-Product Attention

Transformer (5/9) — sqrt(d_k) Scaling & the @V Step

Transformer (6/9) — Multi-Head Attention & Causal Masking

Transformer (7/9) — Residual Connections, Layer Norm & FFN

Transformer (8/9) — Encoder vs Decoder Architecture

Transformer (9/9) — Output Head & Training Essentials

Efficient Attention

FlashAttention (1/4) — GPU Memory Hierarchy: HBM, SRAM, and Why It Matters

FlashAttention (2/4) — Why Attention Is Memory-Bound: The N×N Problem

FlashAttention (3/4) — Tiling and Online Softmax Explained

FlashAttention (4/4) — FlashAttention in Code

Training Fundamentals

Training Fundamentals (1/4) — AdamW: Why Every LLM Uses It

Training Fundamentals (2/4) — Learning Rate Scheduling

Training Fundamentals (3/4) — Mixed Precision Training & Gradient Checkpointing

Training Fundamentals (4/4) — Scaling Laws: Kaplan, Chinchilla, and Beyond

Inference

Inference Optimization — KV Cache, Quantization, and Batching

Evaluation

LLM Evaluation (1/12) — Overview and the Benchmark Treadmill

LLM Evaluation (2/12) — Benchmarks Landscape (1998–2026)

LLM Evaluation (3/12) — Methods Reference Table

LLM Evaluation (4/12) — Paradigms: Fine-Tuning, Few-Shot, Logit vs. Generative

LLM Evaluation (5/12) — Contamination, Gaming, and the Evaluation Crisis

LLM Evaluation (6/12) — How Model Releases Broke Evaluation

LLM Evaluation (7/12) — Chatbot Arena: Mechanics, Economics, Controversies

LLM Evaluation (8/12) — The Null Model Attack on LLM-as-Judge

LLM Evaluation (9/12) — Safety: Red-Teaming, Multi-Turn Attacks, Intent Laundering

LLM Evaluation (10/12) — Process-Based Evaluation, PRMs, and Faithfulness

LLM Evaluation (11/12) — Fairness, HELM's Seven Dimensions, and Multilingual Gaps

LLM Evaluation (12/12) — Infrastructure, Leaderboards, and the Industry-Academia Divide

Post-Training

Post-Training (1/8) — Supervised Fine-Tuning (SFT)

Post-Training (2/8) — The Reward Model

Post-Training (3/8) — KL Divergence in RLHF

Post-Training (4/8) — PPO (Proximal Policy Optimization)

Post-Training (5/8) — DPO (Direct Preference Optimization)

Post-Training (6/8) — GRPO (Group Relative Policy Optimization)

Post-Training (7/8) — Reasoning Training (o1 / DeepSeek-R1)

Post-Training (8/8) — Constitutional AI and RLAIF