← Language

LLM Evaluation (2/12) — Benchmarks Landscape (1998–2026)

LLM Evaluation (2/12) — Benchmarks Landscape (1998–2026)

A reference post. If the previous entry sketched the treadmill, this one lays out the individual sleepers underneath it — every benchmark that mattered, when it appeared, and whether it still does.


Complete benchmark reference

BenchmarkYearAboutFormatSaturated?Creator
MNIST1998Handwritten digit recognitionImage classification~2020s (20+ years)Yann LeCun et al.
GLUE2018NLU: sentiment, entailment, etc.Multi-task~2019NYU/UW/DeepMind
SQuAD 2.02018Reading comprehension with unanswerable questionsExtractive span~2020Stanford
ARC2018Grade-school science4-option MC~2023 (GPT-4, 96.3%)Allen AI
SuperGLUE2019Harder NLU successorMulti-task~2019 (~5 months)Same consortium
HellaSwag2019Commonsense completion4-option MC~2023 (GPT-4, 95.3%)UW (Zellers et al.)
ARC-AGI2019Abstract reasoning / fluid intelligenceVisual pattern completiono3 reached 87.5% (late 2024)François Chollet
MMLUSep 202057-subject knowledge test4-option MC, few-shot~2023 (GPT-4)Hendrycks et al.
GSM8KOct 2021Grade-school math word problemsOpen-ended, verify final number~2023 (GPT-4)OpenAI
HumanEvalJul 2021Function-level code generationUnit tests (pass@k)~2024 (GPT-4o)OpenAI
MATH2021Competition mathematicsOpen-endedLate 2024 (o1)Hendrycks et al.
TruthfulQA2021Tendency to reproduce common falsehoodsGeneration + MCNot fully saturatedOxford (Lin et al.)
BIG-Bench2022204 diverse capability tasksMixed~2022 (~10 months, PaLM 540B)Google + 132 institutions
HELMNov 2022Holistic multi-metric frameworkFrameworkN/AStanford CRFM
Chatbot ArenaMay 2023Real-world human preferenceBlind pairwise, Bradley-Terry/EloUnsaturable by designLMSYS (UC Berkeley)
MT-BenchJun 2023Multi-turn conversation qualityLLM-as-judge 1–10Largely saturatedLMSYS
AlpacaEval2023Instruction-following qualityLLM-as-judge pairwiseGamed (null model)Stanford
SWE-benchOct 2023Real-world software engineeringPatch + test suiteStill differentiatingPrinceton
GPQANov 2023PhD-level science4-option MC, “Google-proof”Approaching ceilingNYU (Rein et al.)
IFEvalNov 2023Instruction-following with verifiable constraintsDeterministic verification~2024 (~4 months)Google
WebArena2023Web agent task completionMulti-step on real websitesNot yetCMU
GSM1K2024Fresh GSM8K-style problems (contamination control)Same format as GSM8KN/A (diagnostic)Zhang et al.
RULERApr 2024Long-context evaluation (13 tasks)Retrieval/reasoningNot yetNVIDIA
Berkeley Function Calling2024Tool use / function callingSerial, parallel, multi-turnNot yetUC Berkeley
LiveCodeBench2024Time-segmented coding problemsCode + testsRefreshes periodicallyAcademic group
OSWorld2024OS-level computer-use agentMulti-step desktop tasksNot yetAcademic group
MMLU-ProJun 2024Harder MMLU, 10-option, reasoning-heavy10-option MCNot yetTiger Lab
LiveBenchJun 2024Contamination-resistant, monthly-refreshedObjective ground truthRefreshes monthlyAbramovich et al.
SimpleQANov 2024Short-form factuality (adversarially curated)Open-ended vs ground truthNot yet (~55% F1)OpenAI
FrontierMathLate 2024Unpublished research-level mathOpen-endedMostly unsolved (~2%)Epoch AI
Humanity’s Last ExamJan 20252,500 expert questions across all fieldsMixedNot yet (~37% top)CAIS (Nature)
MathArena2025Fresh competition math as releasedOpen-endedDynamic, by designETH Zürich
SWE-bench Pro2025Extended SWE-bench: 1,865 problems, 41 repos, 123 languagesPatch + testsNot yetICLR 2026
INCLUDE2025Multilingual with regional/cultural knowledgeMC from local exams, 44 languagesNot yetICLR 2025

Categories

Knowledge & reasoning: MMLU, MMLU-Pro, GPQA, BIG-Bench, TruthfulQA, Humanity’s Last Exam.

Mathematics: GSM8K, MATH, FrontierMath, MathArena, GSM1K (diagnostic).

Code: HumanEval, SWE-bench, SWE-bench Pro, LiveCodeBench.

Human preference: Chatbot Arena, MT-Bench, AlpacaEval.

Instruction following: IFEval.

Long context: RULER.

Agent / tool use: WebArena, OSWorld, Berkeley Function Calling Leaderboard.

Safety & factuality: TruthfulQA, SimpleQA.

Multilingual: INCLUDE.

Dynamic / anti-contamination: LiveBench, LiveCodeBench, MathArena.

Frameworks: HELM (multi-metric), Open LLM Leaderboard (aggregator).