Leaderboard accuracy for frontier LLMs is materially inflated: about 13.8% of MMLU items show evidence of training-data exposure, boosting reported scores (particularly in STEM/professional subjects) and creating memorization fingerprints that can mislead buyers, investors and benchmark-driven procurement.
Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.
Summary
Main Finding
Leaderboards overstate frontier LLM “intelligence” because a large and structurally non‑uniform fraction of benchmark items have been seen (exactly or in paraphrase) during model training. Using three complementary, reproducible methods on six frontier models (GPT-4o, GPT-4o‑mini, DeepSeek‑R1, DeepSeek‑V3, Llama‑3.3‑70B, Qwen3‑235B) and MMLU items, the authors show (1) lexical/web contamination is widespread (Experiment 1), (2) model accuracy drops when surface wording is changed (Experiment 2), and (3) behavioral probes reveal internal memorization signals in most items (Experiment 3). The methods converge on the same subject ranking (STEM most contaminated), and the aggregate evidence implies that much of reported benchmark performance reflects memorization/pattern completion rather than robust generalization.
Key Points
- Experiment 1 (lexical/web scan): 513 MMLU questions sampled (9 per 57 subjects). Using an 8‑gram overlap ≥30% AND the correct answer present in web snippets, overall contamination = 13.8%. STEM highest (18.1%); some subjects much higher (Philosophy 66.7%, Anatomy 55.6%). Estimated Performance Gain (EPG) from contamination ≈ +0.040 overall (per ConTAM), up to +0.054 in STEM.
- Experiment 2 (paraphrase / indirect‑reference diagnostic): 100 MMLU items across six subjects; for each item authors generated paraphrased and indirect‑reference variants and tested all six models at temperature 0. Average accuracy drop when moving to indirect references = 7.0 percentage points; Law and Ethics each drop ≈ 19.8 pp (largest declines), matching high contamination in Experiment 1. Some models (e.g., GPT‑4o) are robust to surface change; others (Qwen3, Llama) show big drops.
- Experiment 3 (TS‑Guessing behavioral probe): TS‑Guessing applied to all 513 sampled MMLU items × 6 models with two tasks:
- Option Mask (OM): ask model to reconstruct a masked wrong choice; primary metric is partial match (≥50% token overlap).
- Word Mask (WM): blank a content word; measure exact reconstruction.
- Flagging rule: OM partial ≥ 0.50 or WM exact = 1 → question flagged. Results: average combined flagging = 72.5% (far above random baselines). Average OM partial ≈ 45.8%; OM exact ≈ 14.2%; WM exact ≈ 47.4%. STEM again ranks highest under behavioral probe (≈55.9%).
- DeepSeek‑R1 anomaly: very low exact recall but very high OM partial (76.6%) and zero WM exact → “distributed memorization” signature (semantic/conceptual storage without verbatim surface retention). This explains its low original accuracy combined with little sensitivity to surface change.
- Multi‑method convergence: web detection, paraphrase sensitivity, and internal behavioral reconstruction produce consistent subject rankings and mutually reinforcing evidence that contamination is pervasive and heterogeneous across domains.
- Practical interpretation: reported high benchmark scores can be materially inflated by memorization/exposure; removing contaminated items reduces reported accuracy and narrows claimed gaps between models and humans.
Data & Methods
- Datasets and sampling
- MMLU test split sampled: 513 questions (9 per 57 subjects) for Experiments 1 & 3; separate 100‑item sample (≈17 per subject across 6 subjects) for Experiment 2. Random seed 42; overlap between E1 and E2 avoided.
- Experiment 1 — Lexical contamination detection
- Query method: first 150 characters of question issued to Tavily web search; top‑5 result snippets collected.
- Contamination flag conditions: (a) fraction of question 8‑grams found in combined snippets > 0.30; AND (b) correct answer text present verbatim in snippets.
- EPG computed: acc × contamination_rate × 0.4 per ConTAM (Singh et al. 2024).
- Conservative lower‑bound estimate (web index not identical to pretraining corpora; excludes paraphrases).
- Experiment 2 — Paraphrase and indirect‑reference diagnostic
- For each sampled question authors generated two variants using GPT‑4o:
- Paraphrase: different wording, identical knowledge requirement and correct answer.
- Indirect reference: key entities replaced by descriptions/associations.
- All six models evaluated on original, paraphrase, indirect at temperature 0.
- Main metric: accuracy change (original → paraphrase/indirect); subject‑level aggregation.
- For each sampled question authors generated two variants using GPT‑4o:
- Experiment 3 — TS‑Guessing behavioral probe
- Two tasks for each question/model (6,156 probes per task):
- Option Mask (OM): mask one wrong answer choice; model told which option is correct and asked to reconstruct masked wrong option. Report OM exact and OM partial (≥50% token overlap); partial is primary.
- Word Mask (WM): blank a long content word in stem; model fills in missing word; report exact matches.
- Flagging rule: OM partial ≥0.50 OR WM exact = 1 → question flagged as memorized.
- Random baselines: OM exact ≈ 0%, OM partial ≈ 5% (empirical), WM exact ≈ ≈0% given vocabulary.
- Two tasks for each question/model (6,156 probes per task):
- Models tested: GPT‑4o, GPT‑4o‑mini, DeepSeek‑R1, DeepSeek‑V3, Llama‑3.3‑70B, Qwen3‑235B. All probes run via public APIs; procedures reproducible from open benchmarks and API calls.
Implications for AI Economics
- Measurement and valuation of capabilities
- Benchmarks are a primary signal used by investors, customers, and markets to value model capabilities. Systematic contamination and memorization mean benchmark scores can materially overstate true generalization. Economic valuations, commercial contracts, and go‑to‑market claims that rely on unadjusted benchmark numbers risk mispricing and misallocation of capital.
- Recommendation: discount headline benchmark performance by contamination‑adjusted EPGs or require reporting of decontaminated benchmarks and robustness measures when valuing models.
- Incentives and market structure
- Leaderboard incentives encourage “benchmaxxing” (including evaluation items in training/fine‑tuning or overfitting to test distributions). This creates a moral‑hazard problem: providers can boost scores cheaply without delivering real capability improvements, distorting competition and reducing welfare from R&D spend.
- Policy or procurement that rewards clean, demonstrable generalization (e.g., requiring contamination audits) would shift incentives toward genuine capability improvements.
- Deployment, liability, and sectoral risk
- Sectors relying on high‑stakes performance (law, medicine, finance) are particularly affected: the paper shows Law and Ethics were most surface‑sensitive and contaminated. Claims that models “match experts” in these domains should be treated skeptically unless validated on contamination‑free or adversarially paraphrased testbeds. Misplaced trust can generate economic losses, liability exposure, and reputational damage.
- Competition and data transparency
- Closed‑model releases limit the use of retrieval‑based contamination detection; behavioral probes (like TS‑Guessing) are necessary and can be standardized. Market participants and regulators should require or incentivize disclosure of training data provenance, contamination audits, and independent robustness testing as part of product certification.
- Labor markets and substitution claims
- If benchmark outperformance is driven by memorization/exposure rather than transferable reasoning, claims about substitutability of human experts are overstated. Economic models forecasting labor displacement or productivity gains should incorporate a contamination‑adjusted estimate of real generalization capacity and the costs of building and validating contamination‑free evaluation.
- Regulatory and procurement implications
- Governments and large buyers should require contamination audits (combining web‑index scans, paraphrase diagnostics, and behavioral probes), mandate reporting of EPG or similar impact metrics, and favor models with demonstrated generalization on decontaminated benchmarks.
- Practical steps for economic actors
- Investors, procurement officers, and corporate buyers: demand contamination‑adjusted metrics, require adversarial/paraphrase robustness tests, and treat high leaderboard scores as preliminary signals, not sole evidence of capability.
- Model developers: publish contamination audits, maintain decontaminated benchmarks (or variants), and invest in robustness testing to build credible claims—this can become a competitive differentiator.
- Regulators: standardize contamination detection reporting, require independent audits for regulated applications, and include benchmark‑cleanliness as a factor in certification.
Short takeaway: public benchmark wins are necessary but not sufficient evidence of economically valuable AI capability. Contamination is widespread, uneven, and can materially inflate perceived competence—economic stakeholders should factor contamination adjustments and robustness testing into valuation, procurement, and policy.
Assessment
Claims (17)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Public leaderboards overstate modern LLM capabilities because substantial portions of benchmark QA items appear in (or are memorized from) training data, inflating measured accuracy. Output Quality | negative | high | inflation of measured benchmark accuracy / overstatement of model capability |
n=513
0.18
|
| Overall lexical contamination: 13.8% of MMLU items show evidence of exposure in training data. Output Quality | negative | high | contamination prevalence (fraction of benchmark items with lexical matches) |
n=513
13.8%
0.18
|
| STEM items show higher lexical contamination (18.1%) relative to the overall rate. Output Quality | negative | high | category-level contamination prevalence (STEM) |
18.1%
0.18
|
| Philosophy category exhibited the maximum observed lexical contamination (up to 66.7%). Output Quality | negative | medium | category-level contamination prevalence (Philosophy) |
66.7%
0.11
|
| Estimated performance uplift from identified contamination ranges from +0.030 to +0.054 absolute accuracy points by category. Output Quality | positive | medium | estimated accuracy uplift (absolute accuracy points) attributable to contamination |
+0.030 to +0.054 absolute accuracy points (by category)
0.11
|
| Paraphrase / indirect-reference diagnostic: on a 100-question subset, average accuracy dropped by 7.0 percentage points under indirect referencing. Output Quality | negative | high | mean accuracy drop (percentage points) under paraphrase/indirect prompts |
n=100
7.0 percentage points (average accuracy drop)
0.18
|
| Law and Ethics questions showed the largest paraphrase-induced accuracy drops (19.8 percentage points). Output Quality | negative | medium | category-specific accuracy drop (percentage points) under paraphrase |
19.8 percentage points (Law & Ethics category drop)
0.11
|
| Behavioral memorization probe (TS‑Guessing) signaled memorization above chance for 72.5% of prompts across all models and items. Output Quality | negative | high | fraction of prompt-model pairs with statistically significant memorization signals |
n=3078
72.5% of prompt-model pairs flagged for memorization above chance
0.18
|
| DeepSeek-R1 exhibits a distributed memorization signature: 76.6% partial reconstruction rate but 0% verbatim recall on the TS‑Guessing probe. Output Quality | mixed | medium | partial reconstruction rate and verbatim recall rate (per-model) |
n=513
76.6% partial reconstruction rate; 0% verbatim recall (TS-Guessing)
0.11
|
| Contamination ranking is consistent across methods: STEM > Professional domains > Social Sciences > Humanities. Output Quality | negative | medium | relative contamination ordering across subject domains |
n=513
Relative ordering: STEM > Professional domains > Social Sciences > Humanities
0.11
|
| Convergence of the three complementary methods (lexical, paraphrase, behavioral) strengthens confidence that contamination is real and systematically inflates scores. Output Quality | positive | high | robustness/confidence in contamination detection (methodological convergence) |
0.18
|
| Triangulation across methods reduces false positives and false negatives inherent to any single contamination-detection approach. Error Rate | positive | medium | expected reduction in detection error (false positives/negatives) via multi-method approach |
0.11
|
| Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models. Other | null_result | high | generalizability of contamination findings to other benchmarks/models |
n=513
0.18
|
| Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories). Other | null_result | high | uncertainty in contamination detection accuracy due to incomplete provenance |
0.18
|
| Leaderboard-based performance is a noisy signal of true capability; contamination can bias model comparisons and distort economic valuation, procurement, and investment decisions. Decision Quality | negative | medium | reliability of leaderboard-based signals for valuation and procurement decisions |
0.11
|
| Practical recommendation: buyers and evaluators should demand contamination audits (triangulating lexical, paraphrase, and behavioral probes) and report both raw and contamination-adjusted scores, especially for high-stakes use. Governance And Regulation | positive | medium | improvement in evaluation reliability when contamination audits and adjusted reporting are adopted (recommended practice) |
0.11
|
| Models trained on publicly mirrored benchmark content provide limited marginal value compared to genuinely novel, high-quality data; high memorization tendency correlates with brittleness and lower generalization value. Output Quality | negative | medium | relative marginal value of contaminated/benchmark-mirrored training data versus novel data; model robustness/generalization |
0.11
|