AI writing assistants are seeding academic papers with fake citations: an audit of 111 million references finds a surge in non-existent bibliographic entries after widespread LLM adoption, concentrated in fast-AI fields and among small or early-career teams. These hallucinated citations disproportionately credit already-prominent, male scholars and often survive moderation and peer review, risking biased and unreliable knowledge accumulation.
Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.
Summary
Main Finding
LLM-generated “hallucinations” have materially infiltrated the scientific literature: using citation existence as a verifiable signal, the authors document a sharp rise in non-existent (unverifiable) references after widespread LLM adoption. Conservatively, the four corpora studied would contain ~146,932 hallucinated citations in 2025 alone. Hallucinations are diffusely embedded across many papers (often as a few bad references per paper), are concentrated in fields with rapid AI uptake, correlate with linguistic signatures of LLM use, are disproportionately produced by smaller or early‑career teams, and tend to allocate credit toward already-prominent and male scholars. Existing moderation and peer‑review processes catch only a minority of these errors, and most hallucinations persist into the published record.
Key Points
- Scale and timing
- Sudden increase in unmatched (unverifiable) citations begins in 2023 and accelerates mid‑2024; steep growth through Aug 2025.
- Estimated excess (interpreted as hallucinations) as of Aug 2025: arXiv 0.39%, bioRxiv 0.21%, SSRN 1.91%, PMC 0.27%.
- Monthly hallucinated-reference estimates in Aug 2025: arXiv ~3,353; bioRxiv ~478; SSRN ~767; PMC ~8,140. Extrapolated to 2025 (four corpora): ~146,932 hallucinated citations.
- Distributional pattern
- Not concentrated in a few manuscripts; many papers contain a modest number of hallucinated citations (rise mainly in papers with 0–10% unmatched refs).
- Higher rates in social sciences and computer science; positive correlation (r = 0.441) between estimated LLM use and hallucination rates at the subfield level.
- Who produces them
- “Hallucination citers” are more likely to be less-experienced authors: large reductions in prior publication counts (e.g., ~62% fewer prior papers in arXiv/bioRxiv; up to ~73% in SSRN).
- Hallucination citers exhibit rapid productivity increases by 2025 (relative increases of ~1.3–3.1× depending on corpus).
- Hallucination rates fall sharply with team size.
- Who benefits
- Hallucinated references often invent authors but, when mapped to real profiles, disproportionately credit high‑productivity and high‑impact authors (e.g., ~68.8% more publications; ~58.3% more citations) and skew toward male names (~6.4 percentage point increase).
- Valid citations appearing in papers that also contain hallucinations also skew toward prominent scholars—suggesting LLMs reallocate credit toward incumbents even when not fabricating.
- Failure of safeguards
- arXiv moderation rejects manuscripts with higher hallucination rates but an estimated 78.8% of non-existent citations still pass moderation and appear on the platform.
- Tracing bioRxiv → published versions shows ~85.3% of preprint hallucinations persist into the published record.
- Hallucinations appear across journal impact deciles; lower‑ and highest‑decile journals show higher and lower rates respectively, but most journals remain susceptible.
Data & Methods
- Datasets (Jan 2020–Aug 2025, unless noted)
- arXiv: 1,465,145 preprints; ~44,107,529 citations (60.9% LaTeX source; remainder via GROBID from PDFs).
- bioRxiv: 261,928 preprints; 21,183,111 XML citations.
- SSRN: 421,698 preprints; 26,815,043 citations (Crossref metadata for SSRN DOIs).
- PubMed Central (PMC): 10% random sample of recent papers — 374,807 manuscripts; 19,245,787 references.
- Hallucination detection pipeline
- Built a local Elasticsearch index from Semantic Scholar and OpenAlex for title matching.
- Initial match rate ~95.1%. Unmatched cases manually inspected: many were non-academic items or parsing errors.
- Applied LLM-based cleaning (GPT-4o-mini) to remove non-academic or malformed references; refined title extraction to reduce unmatched rate to ~1.54%.
- Final verification step: Google Scholar API lookup to capture items outside the indexes; remaining unverifiable references labeled “unmatched.”
- Conservative focus: only non-existent titles (i.e., factual invalidity of the reference title) are counted; measurement errs on being a lower bound.
- Identification strategy
- Use pre‑LLM unmatched citation rate as baseline for routine bibliographic/matching errors; interpret post‑LLM excess as population‑level signature of hallucination.
- Regression framework and mixture model to separate baseline match failures (q) from LLM‑hallucinated probability (p). Controls for field, time, and other confounders; robustness and manual validations reported in SI.
- LLM‑use intensity inferred via linguistic signatures in abstracts (standard methods) and correlated with hallucination incidence.
- Author linking/disambiguation: name extraction + text‑based paper embeddings to link hallucinated cited names to real author profiles conservatively.
Implications for AI Economics
- Knowledge‑production externalities and cumulative risk
- Even modest rates of hallucinated citations create systemic negative externalities because science is cumulative: fabricated references can mislead subsequent synthesis, meta‑analysis, automated summarizers, and downstream policy/industry decisions.
- The persistence of hallucinations into published literature means market actors (funders, firms, policy bodies) and AI systems that scrape literature will internalize and amplify false signals.
- Redistribution of scientific credit and incentives
- LLMs appear to skew citation patterns toward established, high‑status authors (and toward male names), potentially reinforcing incumbency advantages in reputation, hiring, funding, and bibliometric metrics (h‑index, citation counts).
- This can alter incentives in academic labor markets and funding allocation: more citations to incumbents feed a feedback loop, concentrating attention and resources.
- At the same time, LLMs lower production costs for less‑experienced researchers, increasing output but with higher contamination—producing a tradeoff between quantity and reliability.
- Market for verification and detection services
- There is likely to be growing demand (and economic opportunity) for scalable verification tools, provenance trackers, and citation‑validation services (for journals, preprint servers, institutional repositories, and AI tool vendors).
- Platforms and publishers face potential liability or reputational costs if they fail to contain hallucinations; this creates market incentives for third‑party validators and certification services.
- Platform and regulatory implications
- Existing editorial and moderation systems are inadequate at scale; platform-level interventions (automated citation verification, provenance metadata for AI‑assisted writing, mandatory disclosure of tool use) could be economically justified.
- Policymakers and funders may consider standards for AI‑assisted scholarly work, reporting requirements, or audits—creating compliance costs and new governance markets.
- Measurement and research evaluation distortions
- Widespread LLM influence on citation behavior undermines the reliability of citation‑based metrics used in hiring, promotion, and funding decisions; measurement models and evaluation algorithms need to be adjusted to account for AI‑driven distortions.
- Heterogeneous effects across fields and actors
- Fields with rapid LLM uptake (CS, some social sciences) and smaller teams are especially exposed — implying uneven economic impacts across disciplines and institutions.
- Early‑career researchers are both disproportionately producing hallucinations and benefiting from increased productivity; policies should weigh short‑term productivity gains against longer‑term reputational and epistemic costs.
- Policy and market recommendations (high‑level)
- Build and deploy automated, open, standardized citation‑verification pipelines integrated into authoring tools, preprint servers, and journal submission systems.
- Require provenance metadata and disclosure for AI assistance in manuscript preparation (enforceable by journals/funders).
- Invest in public, comprehensive bibliographic indexes and cross‑platform APIs to reduce false negatives/positives and lower verification costs.
- Fund research into incentive‑compatible verification mechanisms and into methods that reduce biased outputs from LLMs (e.g., debiasing, calibrated retrieval).
- Reassess evaluation metrics and hiring/funding criteria to guard against AI‑amplified incumbency biases.
- Create marketplaces for third‑party verification and reputation services and consider liability frameworks for AI tool providers and platforms.
Limitations to keep in mind - Conservative focus on non‑existent titles means the estimates are likely lower bounds; other hallucination types (misattributed claims, incorrect page/volume, mischaracterized findings) are not captured. - LLM use is inferred indirectly (linguistic signatures); causality is supported by timing and correlations but direct causal attribution to specific tools is not established at the individual manuscript level. - Matching and disambiguation use conservative linking criteria; some real citations outside indexed sources could be misclassified despite multi‑step verification.
Summary takeaway LLM hallucinations have begun to penetrate scientific citation networks at scale, creating measurable epistemic and distributional distortions. This poses economic externalities across knowledge markets, incentives, evaluation systems, and platform governance — and creates immediate demand for verification infrastructure, updated norms, and policy interventions to preserve the reliability and equity of scientific production.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We audited 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. Other | null_result | high | number of references audited / dataset coverage |
n=111000000
111 million references across 2.5 million papers
0.8
|
| We find a sharp rise in non-existent references following widespread LLM adoption. Output Quality | positive | high | prevalence of non-existent (hallucinated) references over time |
n=111000000
0.48
|
| We provide a conservative estimate of 146,932 hallucinated citations in 2025 alone. Output Quality | positive | high | count of hallucinated citations in 2025 |
n=111000000
146,932 hallucinated citations in 2025 alone
0.8
|
| These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake. Output Quality | positive | high | rate/prevalence of hallucinated references by research field |
n=2500000
0.48
|
| Hallucinated references are especially pronounced in manuscripts with linguistic signatures of AI-assisted writing. Output Quality | positive | high | association between AI-writing linguistic signatures and presence of hallucinated references |
n=2500000
0.48
|
| Hallucinated references are especially pronounced among small and early-career author teams. Research Productivity | positive | high | rate of hallucinated references by team size and author career stage |
n=2500000
0.48
|
| Hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting LLM-generated errors may reinforce existing inequities in scientific recognition. Inequality | positive | high | distribution of (hallucinated) citation credit by cited-author prominence and gender |
n=111000000
0.48
|
| Preprint moderation and journal publication processes capture only a fraction of these errors. Governance And Regulation | negative | high | fraction of hallucinated references detected/removed by moderation and publication processes |
n=2500000
0.48
|
| LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature. Output Quality | negative | high | risk to reliability and equity of scientific discovery (qualitative assessment) |
0.08
|