← Language

LLM Evaluation (3/12) — Methods Reference Table

LLM Evaluation (3/12) — Methods Reference Table

A companion reference to the benchmarks post. If you want to know how a benchmark is scored, the method matters as much as the questions.


Evaluation paradigms

MethodHow it worksStrengthsWeaknessesStatus (2026)
Logit-basedCompare log-probabilities of each answer option via forward pass; no text generationFast, deterministic; works well for base modelsFails for instruction-tuned models that generate formatted responsesStill used for base models
GenerativeModel generates text; output is parsed to extract answerCaptures what the model actually doesAnswer extraction can fail; slowerDefault for instruction-tuned models since HF Leaderboard v2
Multiple-choiceSelect from predefined options A/B/C/DEasy to score; standardizedPosition bias up to 15.2%; random guessing baselineWidespread but increasingly questioned
Open-ended generationFree-form text scored against ground truth or by judgeTests realistic usage; no guessing advantageHarder to score; needs judge or verifiable answerGrowing, especially math/code
Human preference (pairwise)Users compare two outputs side-by-sideCaptures real-world preferences; hard to contaminateVerbosity/style bias; expensive; English/coding heavyGold standard via Chatbot Arena
LLM-as-judgeStrong model (GPT-4, Claude) evaluates outputsCheap (~$10/run); scalablePosition, verbosity, self-preference bias; exploitableWidely used but known to be flawed
Static benchmarksFixed test set, same questions every timeReproducible; comparable across timeSaturate; contaminate; get gamedBeing replaced by dynamic
Dynamic benchmarksQuestions refreshed periodicallyContamination-resistant; stay discriminativeHard to compare across periodsLiveBench, MathArena, LiveCodeBench
Private/held-outTest problems kept secretImmune to training-data contaminationCan’t be independently verifiedFrontierMath (12 public samples only)
Process reward modelsEvaluate each reasoning step, not just final answerCatches errors at specific stepsLabeling expensive; step boundaries unclearActive research frontier
Automated red-teamingAdversarial models iteratively probe safetyFinds vulnerabilities fixed prompts missHard to standardize; attacker quality variesGrowing rapidly (PAIR, Crescendo, GOAT)
Agent-basedMulti-step task completion in realistic environmentsTests real capability end-to-endScaffolding affects results; slowSWE-bench, WebArena, OSWorld

Contamination detection

MethodHow it worksEffectiveness
N-gram overlapCheck for exact text overlap between benchmark and training dataSimple, easily evaded by paraphrasing
Membership inferenceTest if model “recognizes” specific benchmark itemsModerate; high false-positive rate
Perplexity-basedCheck for unusually low perplexity on benchmark itemsNoisy but detects memorization
TS-GuessingAsk model to guess missing answer optionsGPT-4 achieved 57% exact match on MMLU
Kernel Divergence ScoreFine-tune and observe differential effects on seen vs unseenPrincipled but computationally expensive
Inference-time decontaminationRephrase benchmark questions at test timeReduced inflated accuracy by 22.9% on GSM8K, 19.0% on MMLU
Fresh parallel benchmarks (GSM1K)Create new questions of same type/difficultyMost definitive; some models dropped ~10%
C-BOD perturbationSystematically distort prompts while preserving semantics20/26 models showed significant drops

Diagnostic studies

StudyWhat it testsKey finding
C-BODPrompt-pattern dependence20/26 models drop on semantically identical rephrasings
GSM1KQ&A memorization vs. math reasoningPhi, Mistral drop ~10% on fresh problems
Option-reorderingPositional/selection bias in MCUp to 15.2% swings; can invert rankings
Null model attackLLM-as-judge exploitability86.5% win rate on AlpacaEval 2.0 from a constant non-response
Intent launderingSafety reliance on trigger wordsASR jumps 5.38% → 86.79% when triggers removed
CoT faithfulnessWhether reasoning traces reflect actual computation25–39% faithfulness; unfaithful chains are longer
ReDial dialect studyFairness across English dialectsAlmost all models degrade significantly on AAVE