The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Leaderboard accuracy for frontier LLMs is materially inflated: about 13.8% of MMLU items show evidence of training-data exposure, boosting reported scores (particularly in STEM/professional subjects) and creating memorization fingerprints that can mislead buyers, investors and benchmark-driven procurement.

Are Large Language Models Truly Smarter Than Humans?
Eshwar Reddy M, Sourav Karmakar · March 17, 2026
arxiv descriptive medium evidence 8/10 relevance Source PDF
Public leaderboard scores overstate LLM capability because substantial portions of MMLU questions appear in or are memorized from training data, inflating reported accuracy—especially in STEM and professional domains—and producing model-specific memorization signatures that skew comparisons and valuation.

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

Summary

Main Finding

Leaderboards overstate frontier LLM “intelligence” because a large and structurally non‑uniform fraction of benchmark items have been seen (exactly or in paraphrase) during model training. Using three complementary, reproducible methods on six frontier models (GPT-4o, GPT-4o‑mini, DeepSeek‑R1, DeepSeek‑V3, Llama‑3.3‑70B, Qwen3‑235B) and MMLU items, the authors show (1) lexical/web contamination is widespread (Experiment 1), (2) model accuracy drops when surface wording is changed (Experiment 2), and (3) behavioral probes reveal internal memorization signals in most items (Experiment 3). The methods converge on the same subject ranking (STEM most contaminated), and the aggregate evidence implies that much of reported benchmark performance reflects memorization/pattern completion rather than robust generalization.

Key Points

  • Experiment 1 (lexical/web scan): 513 MMLU questions sampled (9 per 57 subjects). Using an 8‑gram overlap ≥30% AND the correct answer present in web snippets, overall contamination = 13.8%. STEM highest (18.1%); some subjects much higher (Philosophy 66.7%, Anatomy 55.6%). Estimated Performance Gain (EPG) from contamination ≈ +0.040 overall (per ConTAM), up to +0.054 in STEM.
  • Experiment 2 (paraphrase / indirect‑reference diagnostic): 100 MMLU items across six subjects; for each item authors generated paraphrased and indirect‑reference variants and tested all six models at temperature 0. Average accuracy drop when moving to indirect references = 7.0 percentage points; Law and Ethics each drop ≈ 19.8 pp (largest declines), matching high contamination in Experiment 1. Some models (e.g., GPT‑4o) are robust to surface change; others (Qwen3, Llama) show big drops.
  • Experiment 3 (TS‑Guessing behavioral probe): TS‑Guessing applied to all 513 sampled MMLU items × 6 models with two tasks:
    • Option Mask (OM): ask model to reconstruct a masked wrong choice; primary metric is partial match (≥50% token overlap).
    • Word Mask (WM): blank a content word; measure exact reconstruction.
    • Flagging rule: OM partial ≥ 0.50 or WM exact = 1 → question flagged. Results: average combined flagging = 72.5% (far above random baselines). Average OM partial ≈ 45.8%; OM exact ≈ 14.2%; WM exact ≈ 47.4%. STEM again ranks highest under behavioral probe (≈55.9%).
  • DeepSeek‑R1 anomaly: very low exact recall but very high OM partial (76.6%) and zero WM exact → “distributed memorization” signature (semantic/conceptual storage without verbatim surface retention). This explains its low original accuracy combined with little sensitivity to surface change.
  • Multi‑method convergence: web detection, paraphrase sensitivity, and internal behavioral reconstruction produce consistent subject rankings and mutually reinforcing evidence that contamination is pervasive and heterogeneous across domains.
  • Practical interpretation: reported high benchmark scores can be materially inflated by memorization/exposure; removing contaminated items reduces reported accuracy and narrows claimed gaps between models and humans.

Data & Methods

  • Datasets and sampling
    • MMLU test split sampled: 513 questions (9 per 57 subjects) for Experiments 1 & 3; separate 100‑item sample (≈17 per subject across 6 subjects) for Experiment 2. Random seed 42; overlap between E1 and E2 avoided.
  • Experiment 1 — Lexical contamination detection
    • Query method: first 150 characters of question issued to Tavily web search; top‑5 result snippets collected.
    • Contamination flag conditions: (a) fraction of question 8‑grams found in combined snippets > 0.30; AND (b) correct answer text present verbatim in snippets.
    • EPG computed: acc × contamination_rate × 0.4 per ConTAM (Singh et al. 2024).
    • Conservative lower‑bound estimate (web index not identical to pretraining corpora; excludes paraphrases).
  • Experiment 2 — Paraphrase and indirect‑reference diagnostic
    • For each sampled question authors generated two variants using GPT‑4o:
      • Paraphrase: different wording, identical knowledge requirement and correct answer.
      • Indirect reference: key entities replaced by descriptions/associations.
    • All six models evaluated on original, paraphrase, indirect at temperature 0.
    • Main metric: accuracy change (original → paraphrase/indirect); subject‑level aggregation.
  • Experiment 3 — TS‑Guessing behavioral probe
    • Two tasks for each question/model (6,156 probes per task):
      • Option Mask (OM): mask one wrong answer choice; model told which option is correct and asked to reconstruct masked wrong option. Report OM exact and OM partial (≥50% token overlap); partial is primary.
      • Word Mask (WM): blank a long content word in stem; model fills in missing word; report exact matches.
    • Flagging rule: OM partial ≥0.50 OR WM exact = 1 → question flagged as memorized.
    • Random baselines: OM exact ≈ 0%, OM partial ≈ 5% (empirical), WM exact ≈ ≈0% given vocabulary.
  • Models tested: GPT‑4o, GPT‑4o‑mini, DeepSeek‑R1, DeepSeek‑V3, Llama‑3.3‑70B, Qwen3‑235B. All probes run via public APIs; procedures reproducible from open benchmarks and API calls.

Implications for AI Economics

  • Measurement and valuation of capabilities
    • Benchmarks are a primary signal used by investors, customers, and markets to value model capabilities. Systematic contamination and memorization mean benchmark scores can materially overstate true generalization. Economic valuations, commercial contracts, and go‑to‑market claims that rely on unadjusted benchmark numbers risk mispricing and misallocation of capital.
    • Recommendation: discount headline benchmark performance by contamination‑adjusted EPGs or require reporting of decontaminated benchmarks and robustness measures when valuing models.
  • Incentives and market structure
    • Leaderboard incentives encourage “benchmaxxing” (including evaluation items in training/fine‑tuning or overfitting to test distributions). This creates a moral‑hazard problem: providers can boost scores cheaply without delivering real capability improvements, distorting competition and reducing welfare from R&D spend.
    • Policy or procurement that rewards clean, demonstrable generalization (e.g., requiring contamination audits) would shift incentives toward genuine capability improvements.
  • Deployment, liability, and sectoral risk
    • Sectors relying on high‑stakes performance (law, medicine, finance) are particularly affected: the paper shows Law and Ethics were most surface‑sensitive and contaminated. Claims that models “match experts” in these domains should be treated skeptically unless validated on contamination‑free or adversarially paraphrased testbeds. Misplaced trust can generate economic losses, liability exposure, and reputational damage.
  • Competition and data transparency
    • Closed‑model releases limit the use of retrieval‑based contamination detection; behavioral probes (like TS‑Guessing) are necessary and can be standardized. Market participants and regulators should require or incentivize disclosure of training data provenance, contamination audits, and independent robustness testing as part of product certification.
  • Labor markets and substitution claims
    • If benchmark outperformance is driven by memorization/exposure rather than transferable reasoning, claims about substitutability of human experts are overstated. Economic models forecasting labor displacement or productivity gains should incorporate a contamination‑adjusted estimate of real generalization capacity and the costs of building and validating contamination‑free evaluation.
  • Regulatory and procurement implications
    • Governments and large buyers should require contamination audits (combining web‑index scans, paraphrase diagnostics, and behavioral probes), mandate reporting of EPG or similar impact metrics, and favor models with demonstrated generalization on decontaminated benchmarks.
  • Practical steps for economic actors
    • Investors, procurement officers, and corporate buyers: demand contamination‑adjusted metrics, require adversarial/paraphrase robustness tests, and treat high leaderboard scores as preliminary signals, not sole evidence of capability.
    • Model developers: publish contamination audits, maintain decontaminated benchmarks (or variants), and invest in robustness testing to build credible claims—this can become a competitive differentiator.
    • Regulators: standardize contamination detection reporting, require independent audits for regulated applications, and include benchmark‑cleanliness as a factor in certification.

Short takeaway: public benchmark wins are necessary but not sufficient evidence of economically valuable AI capability. Contamination is widespread, uneven, and can materially inflate perceived competence—economic stakeholders should factor contamination adjustments and robustness testing into valuation, procurement, and policy.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Findings are supported by convergent signals from three complementary methods applied exhaustively across six models and the full 513-item MMLU set, which strengthens internal validity; however, incomplete provenance of proprietary training data, limits of web/search coverage, focus on a single benchmark (MMLU), and some small-sample diagnostics (100-question paraphrase subset) mean contamination estimates and uplift calculations remain approximate and could be biased or incomplete. Methods Rigorhigh — The study triangulates lexical, semantic (paraphrase), and behavioral tests, applies systematic statistical thresholds (TS-Guessing) across all items and models, and quantifies category-level uplift—demonstrating careful, multi-pronged methodology; nevertheless, rigor is constrained by external factors the authors acknowledge (undisclosed proprietary corpora, evolving model versions, and imperfect web/search coverage), which affect the interpretability of absolute contamination rates. SampleEvaluation of six frontier LLMs (GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, Qwen3-235B) on the MMLU benchmark (513 questions across 57 subjects), with a 100-question paraphrase/indirect-reference subset and exhaustive behavioral TS-Guessing probes applied to all 513 items × six models; lexical contamination searches used training-era public corpora and open-web indexes where available. Themesgovernance adoption IdentificationNo causal identification per se; uses a multi-method contamination audit combining (1) lexical-matching searches across training-era public corpora and the open web to flag literal/near-literal exposures, (2) paraphrase/indirect-reference diagnostics (100-question subset) to detect reliance on surface matches, and (3) behavioral memorization probes (TS-Guessing) to detect verbatim and distributed recall; estimated accuracy uplift is derived by simulating model behavior on items flagged as exposed. GeneralizabilityResults are specific to the MMLU benchmark and may not generalize to other benchmarks or real-world tasks., Only six (contemporary) models were tested; findings may differ for other or newer models and training regimens., Training-data provenance is often incomplete or proprietary, so lexical searches may miss exposures (false negatives) or misattribute coincidences (false positives)., Paraphrase diagnostics used a 100-question subset, limiting precision of some category estimates., Language, regional content, and non-public data sources (e.g., private datasets) were not fully covered, reducing applicability across languages/domains., Contamination is time-varying; model updates and changing web corpora mean results may not hold over time.

Claims (17)

ClaimDirectionConfidenceOutcomeDetails
Public leaderboards overstate modern LLM capabilities because substantial portions of benchmark QA items appear in (or are memorized from) training data, inflating measured accuracy. Output Quality negative high inflation of measured benchmark accuracy / overstatement of model capability
n=513
0.18
Overall lexical contamination: 13.8% of MMLU items show evidence of exposure in training data. Output Quality negative high contamination prevalence (fraction of benchmark items with lexical matches)
n=513
13.8%
0.18
STEM items show higher lexical contamination (18.1%) relative to the overall rate. Output Quality negative high category-level contamination prevalence (STEM)
18.1%
0.18
Philosophy category exhibited the maximum observed lexical contamination (up to 66.7%). Output Quality negative medium category-level contamination prevalence (Philosophy)
66.7%
0.11
Estimated performance uplift from identified contamination ranges from +0.030 to +0.054 absolute accuracy points by category. Output Quality positive medium estimated accuracy uplift (absolute accuracy points) attributable to contamination
+0.030 to +0.054 absolute accuracy points (by category)
0.11
Paraphrase / indirect-reference diagnostic: on a 100-question subset, average accuracy dropped by 7.0 percentage points under indirect referencing. Output Quality negative high mean accuracy drop (percentage points) under paraphrase/indirect prompts
n=100
7.0 percentage points (average accuracy drop)
0.18
Law and Ethics questions showed the largest paraphrase-induced accuracy drops (19.8 percentage points). Output Quality negative medium category-specific accuracy drop (percentage points) under paraphrase
19.8 percentage points (Law & Ethics category drop)
0.11
Behavioral memorization probe (TS‑Guessing) signaled memorization above chance for 72.5% of prompts across all models and items. Output Quality negative high fraction of prompt-model pairs with statistically significant memorization signals
n=3078
72.5% of prompt-model pairs flagged for memorization above chance
0.18
DeepSeek-R1 exhibits a distributed memorization signature: 76.6% partial reconstruction rate but 0% verbatim recall on the TS‑Guessing probe. Output Quality mixed medium partial reconstruction rate and verbatim recall rate (per-model)
n=513
76.6% partial reconstruction rate; 0% verbatim recall (TS-Guessing)
0.11
Contamination ranking is consistent across methods: STEM > Professional domains > Social Sciences > Humanities. Output Quality negative medium relative contamination ordering across subject domains
n=513
Relative ordering: STEM > Professional domains > Social Sciences > Humanities
0.11
Convergence of the three complementary methods (lexical, paraphrase, behavioral) strengthens confidence that contamination is real and systematically inflates scores. Output Quality positive high robustness/confidence in contamination detection (methodological convergence)
0.18
Triangulation across methods reduces false positives and false negatives inherent to any single contamination-detection approach. Error Rate positive medium expected reduction in detection error (false positives/negatives) via multi-method approach
0.11
Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models. Other null_result high generalizability of contamination findings to other benchmarks/models
n=513
0.18
Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories). Other null_result high uncertainty in contamination detection accuracy due to incomplete provenance
0.18
Leaderboard-based performance is a noisy signal of true capability; contamination can bias model comparisons and distort economic valuation, procurement, and investment decisions. Decision Quality negative medium reliability of leaderboard-based signals for valuation and procurement decisions
0.11
Practical recommendation: buyers and evaluators should demand contamination audits (triangulating lexical, paraphrase, and behavioral probes) and report both raw and contamination-adjusted scores, especially for high-stakes use. Governance And Regulation positive medium improvement in evaluation reliability when contamination audits and adjusted reporting are adopted (recommended practice)
0.11
Models trained on publicly mirrored benchmark content provide limited marginal value compared to genuinely novel, high-quality data; high memorization tendency correlates with brittleness and lower generalization value. Output Quality negative medium relative marginal value of contaminated/benchmark-mirrored training data versus novel data; model robustness/generalization
0.11

Notes