| Logit-based | Compare log-probabilities of each answer option via forward pass; no text generation | Fast, deterministic; works well for base models | Fails for instruction-tuned models that generate formatted responses | Still used for base models |
| Generative | Model generates text; output is parsed to extract answer | Captures what the model actually does | Answer extraction can fail; slower | Default for instruction-tuned models since HF Leaderboard v2 |
| Multiple-choice | Select from predefined options A/B/C/D | Easy to score; standardized | Position bias up to 15.2%; random guessing baseline | Widespread but increasingly questioned |
| Open-ended generation | Free-form text scored against ground truth or by judge | Tests realistic usage; no guessing advantage | Harder to score; needs judge or verifiable answer | Growing, especially math/code |
| Human preference (pairwise) | Users compare two outputs side-by-side | Captures real-world preferences; hard to contaminate | Verbosity/style bias; expensive; English/coding heavy | Gold standard via Chatbot Arena |
| LLM-as-judge | Strong model (GPT-4, Claude) evaluates outputs | Cheap (~$10/run); scalable | Position, verbosity, self-preference bias; exploitable | Widely used but known to be flawed |
| Static benchmarks | Fixed test set, same questions every time | Reproducible; comparable across time | Saturate; contaminate; get gamed | Being replaced by dynamic |
| Dynamic benchmarks | Questions refreshed periodically | Contamination-resistant; stay discriminative | Hard to compare across periods | LiveBench, MathArena, LiveCodeBench |
| Private/held-out | Test problems kept secret | Immune to training-data contamination | Can’t be independently verified | FrontierMath (12 public samples only) |
| Process reward models | Evaluate each reasoning step, not just final answer | Catches errors at specific steps | Labeling expensive; step boundaries unclear | Active research frontier |
| Automated red-teaming | Adversarial models iteratively probe safety | Finds vulnerabilities fixed prompts miss | Hard to standardize; attacker quality varies | Growing rapidly (PAIR, Crescendo, GOAT) |
| Agent-based | Multi-step task completion in realistic environments | Tests real capability end-to-end | Scaffolding affects results; slow | SWE-bench, WebArena, OSWorld |