Frontier AI can help spot and fix errors in economic theory when paired with an expert, but it cannot yet autonomously refute published proofs; in tests on four flawed papers, ChatGPT Pro performed best with substantial human guidance while other models often failed, and potential training-data contamination clouds interpretation.

Can AI Refute Economic Theory? Evidence from Beyond the Knowledge Cutoff

Alexis Akira Toda · June 03, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

Frontier LLMs can sometimes construct counterexamples or corrected proofs when heavily guided by a competent human, but they did not autonomously detect true theoretical errors and results are limited by small sample size and possible data contamination.

Can artificial intelligence (AI) refute economic theory? I document experiments in which I asked several AI models (Gemini, Refine, Claude, and ChatGPT) to check the correctness of four published papers in economic theory, each containing an error that I helped identify or correct. ChatGPT Pro performed best, occasionally constructing counterexamples and corrected proofs, while other models fared worse. However, no model located a true error without substantial human guidance, and data contamination complicates interpretation. I argue that a competent human paired with a frontier model can outperform current peer review, but AI cannot yet refute economic theory on its own.

Summary

Main Finding

Frontier large language models (LLMs) can substantially aid checking mathematical arguments in economic theory—sometimes producing counterexamples and corrected proofs—but they do not yet reliably discover genuine errors on their own. A competent human working with a top model (ChatGPT 5.5 Pro in these experiments) can outperform standard refereeing, yet AI alone cannot yet refute economic theory without substantial human guidance and careful handling of training-data contamination.

Key Points

Experiment scope: Toda tested several LLMs (Gemini 3.5/3.1, Refine, Claude Opus 4.8, ChatGPT 5.5 Pro) on four published economic-theory papers that the author knows contain errors (Tirole 1985; Kocherlakota 1992; Miao & Wang 2018; Stachurski & Toda 2019).
Best performer: ChatGPT 5.5 Pro gave the strongest results—flagging issues from a single prompt in some cases, constructing explicit counterexamples, and producing corrected proofs.
Mixed performance: Claude was relatively stronger on economic interpretation and computation but weaker on formal reasoning; Gemini often produced plausible-sounding but flawed arguments; Refine’s chat was useful but its paid reports could be nitpicky or incomplete.
Human guidance required: In the crucial Tirole (1985) case — whose correction had only just been published — no model located the true error without iterative human steering. Initial model responses typically endorsed the (flawed) proofs until the author directed attention to the problematic steps.
Knowledge-contamination caveat: Because many target papers and corrections had circulated, the models’ apparent performance may partially reflect retrieval of training-data content rather than genuine reasoning. Toda took care in the Tirole experiment to disable memory and web search and ran it soon after the correction’s publication; he argues the counterexample produced by ChatGPT Pro there was genuinely constructed and different from the published correction.
Costs and tooling: Tools varied in price and capability (Refine ≈ $50 per report, Claude Opus ≈ $20/month, ChatGPT Pro ≈ $100/month). Computational and numerical checks were useful (e.g., Claude), but formal reasoning gaps and hallucinations remained common.

Data & Methods

Papers tested (deliberately chosen because the author helped identify/correct known errors):
- Tirole (1985), Proposition 1(c) — error restored in Pham & Toda (2026)
- Kocherlakota (1992), Proposition 4 — corrigendum in Kocherlakota & Toda (2023)
- Miao & Wang (2018) — definitional/interpretation issues addressed in subsequent exchanges
- Stachurski & Toda (2019), Proposition 5 — corrected in Stachurski & Toda (2020)
Models evaluated: Google Gemini (3.5 Flash, 3.1 Pro), Refine, Anthropic Claude Opus 4.8, OpenAI ChatGPT 5.5 Pro.
Protocol:
- Upload PDF of paper to model interfaces.
- Initial prompt: “Check the mathematical correctness of the key result / specified proposition.”
- Iterative challenge: If the model endorsed the result, the experimenter probed specific logical steps, pointed out suspected gaps, and steered the model toward the problematic argument; asked for counterexamples, numerical checks, or corrected proofs.
- Contamination control: For the Tirole (1985) run with ChatGPT Pro, both memory and web-search were disabled to reduce the chance of retrieval from post-cutoff material.
Evaluation criteria: ability to (a) detect the logical flaw, (b) construct or suggest counterexamples, (c) produce corrected, rigorous proofs, and (d) correctly interpret economic definitions (e.g., what counts as a Santos–Woodford rational bubble).
Limitations of the experimental design:
- The papers were not blind tests — the author knew the errors and actively steered models, so the setup measures the human+AI workflow more than model autonomy.
- Training-data overlap makes it difficult to attribute correct model outputs to reasoning vs. memorization except where contamination was controlled.

Implications for AI Economics

Practical role today: LLMs are effective assistants for checking and repairing proofs once a human identifies a promising lead. They can reduce the labor and cost of verification and may speed up discovery/correction cycles when paired with expert guidance.
Peer review: A competent referee using a frontier model can arguably outperform typical refereeing outcomes for formal correctness, suggesting journals and referees should incorporate AI-assisted checks into review workflows.
Limits & risks:
- Autonomous error discovery is not yet reliable: models often produce plausible—but incorrect—arguments or require iterative human probing to locate deep flaws.
- Hallucination and overconfidence: models may assert incorrect technical claims (e.g., unjustified subsequence arguments, mistaken monotonicity implications) and should not be trusted without independent verification.
- Training-data contamination: apparent reasoning successes may reflect retrieval of post-cutoff corrections or online discussions; claims that an AI “found” an error must be conditioned on contamination checks (disable web search, check for identical published corrections).
- Cost and access: the best-performing configurations are behind paid tiers; equitable access and reproducibility are concerns.
Recommendations for practice and research:
- Use LLMs as part of a human-in-the-loop workflow: have domain experts steer prompts, request formalizations, and validate outputs.
- Combine LLMs with formal proof assistants and numerical solvers when possible to reduce ambiguity and increase rigor.
- For evaluation of model reasoning, design contamination-robust benchmarks and disclose steps taken to avoid retrieval (e.g., disabling web search, noting knowledge cutoffs).
- Journals should consider deploying AI screening for obvious technical gaps and encourage authors to submit machine-checkable appendices or formal proofs where feasible.
Outlook: Rapid advances in model capabilities (and integration with formal tools) make it plausible that future systems will autonomously detect and sometimes correct errors in economic theory. For now, the most productive deployment is expert+model collaboration rather than model-alone refutation.

Brief concluding takeaway: LLMs are already powerful aids that can materially improve error detection and proof repair in economics when wielded by knowledgeable researchers, but they are not yet independent refuters of published theory.

Assessment

Paper Typedescriptive Evidence Strengthlow — Very small, non-random sample (four papers) selected by the author (selection bias), no counterfactual or control condition, heavy human-in-the-loop prompting, and potential training-data contamination make it difficult to draw general causal or robust empirical conclusions. Methods Rigorlow — Informal, small-N prompt experiments without pre-registration or standardized scoring; subjective judgments about whether models located errors; lack of systematic variation in prompts, model versions, or blinded evaluation increases risk of researcher degrees of freedom and biases. SamplePrompted evaluations of several large language models (Gemini, Refine, Claude, and ChatGPT Pro) on four published economic-theory papers that contained known errors (errors identified or corrected by the author); experiments involved iterative human guidance and attempted model discovery of counterexamples or proof corrections; possible overlap between some papers and models' training data. Themeshuman_ai_collab productivity GeneralizabilityVery small sample size (four papers) limits external validity, Selection bias: papers were chosen because the author had identified errors, Results apply to the specific models and versions tested and may not generalize to future or other models, Human-in-the-loop prompting was central, so findings do not generalize to fully autonomous model use, Potential training-data contamination (models may have seen the papers) undermines claims about independent model discovery, Domain-specific to formal economic theory proofs; not necessarily applicable to empirical economics or other disciplines

Claims (8)

Claim	Direction	Confidence	Outcome	Details
I conducted experiments in which I asked several AI models (Gemini, Refine, Claude, and ChatGPT) to check the correctness of four published papers in economic theory. Other	null_result	high	existence of experiments using specified models on 4 papers	n=4 0.18
Each of the four published papers used in the experiments contained an error that I helped identify or correct. Other	null_result	high	presence of errors in the 4 target papers	n=4 0.18
ChatGPT Pro performed best among the tested models, occasionally constructing counterexamples and corrected proofs. Output Quality	positive	high	output_quality (ability to construct counterexamples and corrected proofs)	n=4 0.18
Other models (Gemini, Refine, Claude) fared worse than ChatGPT Pro at these tasks. Output Quality	negative	high	output_quality (relative performance across models)	n=4 0.18
No model located a true error without substantial human guidance. Error Rate	negative	high	error_detection_without_human_guidance	n=4 0.18
Data contamination (training-data overlap) complicates interpretation of the models' performance. Other	mixed	high	validity_of_experimental_interpretation_due_to_data_contamination	0.18
A competent human paired with a frontier model can outperform current peer review. Research Productivity	positive	medium	effectiveness_of_error_detection_relative_to_peer_review	n=4 0.02
AI cannot yet refute economic theory on its own. Other	negative	high	autonomous_theory_refutation_capability	n=4 0.18