AI writing assistants are seeding academic papers with fake citations: an audit of 111 million references finds a surge in non-existent bibliographic entries after widespread LLM adoption, concentrated in fast-AI fields and among small or early-career teams. These hallucinated citations disproportionately credit already-prominent, male scholars and often survive moderation and peer review, risking biased and unreliable knowledge accumulation.

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

Zhenyue Zhao, Yihe Wang, Toby Stuart, Mathijs De Vaan, Paul Ginsparg, Yian Yin · May 08, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

An automated audit of 111 million references in 2.5 million papers documents a sharp rise in non-existent (hallucinated) citations coinciding with LLM uptake, concentrated in AI-active fields, manuscripts with linguistic signs of AI assistance, and among small or early-career teams, and with biased attribution toward prominent and male scholars.

Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.

Summary

Main Finding

LLM-generated “hallucinations” have materially infiltrated the scientific literature: using citation existence as a verifiable signal, the authors document a sharp rise in non-existent (unverifiable) references after widespread LLM adoption. Conservatively, the four corpora studied would contain ~146,932 hallucinated citations in 2025 alone. Hallucinations are diffusely embedded across many papers (often as a few bad references per paper), are concentrated in fields with rapid AI uptake, correlate with linguistic signatures of LLM use, are disproportionately produced by smaller or early‑career teams, and tend to allocate credit toward already-prominent and male scholars. Existing moderation and peer‑review processes catch only a minority of these errors, and most hallucinations persist into the published record.

Key Points

Scale and timing
- Sudden increase in unmatched (unverifiable) citations begins in 2023 and accelerates mid‑2024; steep growth through Aug 2025.
- Estimated excess (interpreted as hallucinations) as of Aug 2025: arXiv 0.39%, bioRxiv 0.21%, SSRN 1.91%, PMC 0.27%.
- Monthly hallucinated-reference estimates in Aug 2025: arXiv ~3,353; bioRxiv ~478; SSRN ~767; PMC ~8,140. Extrapolated to 2025 (four corpora): ~146,932 hallucinated citations.
Distributional pattern
- Not concentrated in a few manuscripts; many papers contain a modest number of hallucinated citations (rise mainly in papers with 0–10% unmatched refs).
- Higher rates in social sciences and computer science; positive correlation (r = 0.441) between estimated LLM use and hallucination rates at the subfield level.
Who produces them
- “Hallucination citers” are more likely to be less-experienced authors: large reductions in prior publication counts (e.g., ~62% fewer prior papers in arXiv/bioRxiv; up to ~73% in SSRN).
- Hallucination citers exhibit rapid productivity increases by 2025 (relative increases of ~1.3–3.1× depending on corpus).
- Hallucination rates fall sharply with team size.
Who benefits
- Hallucinated references often invent authors but, when mapped to real profiles, disproportionately credit high‑productivity and high‑impact authors (e.g., ~68.8% more publications; ~58.3% more citations) and skew toward male names (~6.4 percentage point increase).
- Valid citations appearing in papers that also contain hallucinations also skew toward prominent scholars—suggesting LLMs reallocate credit toward incumbents even when not fabricating.
Failure of safeguards
- arXiv moderation rejects manuscripts with higher hallucination rates but an estimated 78.8% of non-existent citations still pass moderation and appear on the platform.
- Tracing bioRxiv → published versions shows ~85.3% of preprint hallucinations persist into the published record.
- Hallucinations appear across journal impact deciles; lower‑ and highest‑decile journals show higher and lower rates respectively, but most journals remain susceptible.

Data & Methods

Datasets (Jan 2020–Aug 2025, unless noted)
- arXiv: 1,465,145 preprints; ~44,107,529 citations (60.9% LaTeX source; remainder via GROBID from PDFs).
- bioRxiv: 261,928 preprints; 21,183,111 XML citations.
- SSRN: 421,698 preprints; 26,815,043 citations (Crossref metadata for SSRN DOIs).
- PubMed Central (PMC): 10% random sample of recent papers — 374,807 manuscripts; 19,245,787 references.
Hallucination detection pipeline
- Built a local Elasticsearch index from Semantic Scholar and OpenAlex for title matching.
- Initial match rate ~95.1%. Unmatched cases manually inspected: many were non-academic items or parsing errors.
- Applied LLM-based cleaning (GPT-4o-mini) to remove non-academic or malformed references; refined title extraction to reduce unmatched rate to ~1.54%.
- Final verification step: Google Scholar API lookup to capture items outside the indexes; remaining unverifiable references labeled “unmatched.”
- Conservative focus: only non-existent titles (i.e., factual invalidity of the reference title) are counted; measurement errs on being a lower bound.
Identification strategy
- Use pre‑LLM unmatched citation rate as baseline for routine bibliographic/matching errors; interpret post‑LLM excess as population‑level signature of hallucination.
- Regression framework and mixture model to separate baseline match failures (q) from LLM‑hallucinated probability (p). Controls for field, time, and other confounders; robustness and manual validations reported in SI.
- LLM‑use intensity inferred via linguistic signatures in abstracts (standard methods) and correlated with hallucination incidence.
- Author linking/disambiguation: name extraction + text‑based paper embeddings to link hallucinated cited names to real author profiles conservatively.

Implications for AI Economics

Knowledge‑production externalities and cumulative risk
- Even modest rates of hallucinated citations create systemic negative externalities because science is cumulative: fabricated references can mislead subsequent synthesis, meta‑analysis, automated summarizers, and downstream policy/industry decisions.
- The persistence of hallucinations into published literature means market actors (funders, firms, policy bodies) and AI systems that scrape literature will internalize and amplify false signals.
Redistribution of scientific credit and incentives
- LLMs appear to skew citation patterns toward established, high‑status authors (and toward male names), potentially reinforcing incumbency advantages in reputation, hiring, funding, and bibliometric metrics (h‑index, citation counts).
- This can alter incentives in academic labor markets and funding allocation: more citations to incumbents feed a feedback loop, concentrating attention and resources.
- At the same time, LLMs lower production costs for less‑experienced researchers, increasing output but with higher contamination—producing a tradeoff between quantity and reliability.
Market for verification and detection services
- There is likely to be growing demand (and economic opportunity) for scalable verification tools, provenance trackers, and citation‑validation services (for journals, preprint servers, institutional repositories, and AI tool vendors).
- Platforms and publishers face potential liability or reputational costs if they fail to contain hallucinations; this creates market incentives for third‑party validators and certification services.
Platform and regulatory implications
- Existing editorial and moderation systems are inadequate at scale; platform-level interventions (automated citation verification, provenance metadata for AI‑assisted writing, mandatory disclosure of tool use) could be economically justified.
- Policymakers and funders may consider standards for AI‑assisted scholarly work, reporting requirements, or audits—creating compliance costs and new governance markets.
Measurement and research evaluation distortions
- Widespread LLM influence on citation behavior undermines the reliability of citation‑based metrics used in hiring, promotion, and funding decisions; measurement models and evaluation algorithms need to be adjusted to account for AI‑driven distortions.
Heterogeneous effects across fields and actors
- Fields with rapid LLM uptake (CS, some social sciences) and smaller teams are especially exposed — implying uneven economic impacts across disciplines and institutions.
- Early‑career researchers are both disproportionately producing hallucinations and benefiting from increased productivity; policies should weigh short‑term productivity gains against longer‑term reputational and epistemic costs.
Policy and market recommendations (high‑level)
- Build and deploy automated, open, standardized citation‑verification pipelines integrated into authoring tools, preprint servers, and journal submission systems.
- Require provenance metadata and disclosure for AI assistance in manuscript preparation (enforceable by journals/funders).
- Invest in public, comprehensive bibliographic indexes and cross‑platform APIs to reduce false negatives/positives and lower verification costs.
- Fund research into incentive‑compatible verification mechanisms and into methods that reduce biased outputs from LLMs (e.g., debiasing, calibrated retrieval).
- Reassess evaluation metrics and hiring/funding criteria to guard against AI‑amplified incumbency biases.
- Create marketplaces for third‑party verification and reputation services and consider liability frameworks for AI tool providers and platforms.

Limitations to keep in mind - Conservative focus on non‑existent titles means the estimates are likely lower bounds; other hallucination types (misattributed claims, incorrect page/volume, mischaracterized findings) are not captured. - LLM use is inferred indirectly (linguistic signatures); causality is supported by timing and correlations but direct causal attribution to specific tools is not established at the individual manuscript level. - Matching and disambiguation use conservative linking criteria; some real citations outside indexed sources could be misclassified despite multi‑step verification.

Summary takeaway LLM hallucinations have begun to penetrate scientific citation networks at scale, creating measurable epistemic and distributional distortions. This poses economic externalities across knowledge markets, incentives, evaluation systems, and platform governance — and creates immediate demand for verification infrastructure, updated norms, and policy interventions to preserve the reliability and equity of scientific production.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Strengths include an unusually large, verifiable outcome (existence of cited references) across 111 million references and 2.5 million papers, plus consistent patterns across fields, author types, and AI-writing linguistic markers; weaknesses are that the design is observational (risk of confounding by concurrent trends), reliance on proxy measures for LLM assistance and for citation hallucination (automated matching may misclassify), and limited ability to definitively attribute the rise to LLMs rather than other contemporaneous changes in authorship or citation practices. Methods Rigormedium — The study uses large-scale automated matching and classification, validation against verifiable metadata, and multiple robustness checks and subgroups, indicating careful empirical work; however, results depend on the accuracy of the citation-matching pipeline and AI-writing classifiers, potential measurement error, and the absence of quasi-experimental instruments or exogenous shocks that would more cleanly identify causality. SampleAutomated audit of 111 million references drawn from 2.5 million documents across arXiv, bioRxiv, SSRN and PubMed Central (preprints and open-access journal content), covering papers up to and including 2025; authors classify references as existent or non-existent via metadata matching and analyze associations with field, linguistic signals of AI assistance, team size, career stage, author gender, and publication/moderation outcomes. Themesinnovation inequality IdentificationObservational time-series and cross-sectional variation: the authors compare citation validity rates before and after the rapid uptake of LLMs, exploit field-level differences in AI adoption, and use linguistic classifiers (signatures of AI-assisted writing) and author/team attributes (team size, career stage) to link elevated rates of non-existent citations to likely LLM use; they report robustness checks across repositories and moderation/publication outcomes. There is no randomized assignment, so causal claims rest on temporal co-movement and multiple associational patterns rather than exogenous variation. GeneralizabilityRestricted to papers in arXiv, bioRxiv, SSRN, and PubMed Central (preprints and open-access journals) and may not generalize to closed-access journals or books, Likely biased toward English-language and STEM/biomedical/social-science domains represented in these repositories, Findings reflect early post-LLM adoption years (through 2025) and may change as tools, prompts, and safeguards evolve, Relies on automated citation-matching and AI-writing classifiers that may misclassify some references or authorship practices, Cannot necessarily be generalized to other forms of hallucination (e.g., factual claims, figures) beyond bibliographic references

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We audited 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. Other	null_result	high	number of references audited / dataset coverage	n=111000000 111 million references across 2.5 million papers 0.8
We find a sharp rise in non-existent references following widespread LLM adoption. Output Quality	positive	high	prevalence of non-existent (hallucinated) references over time	n=111000000 0.48
We provide a conservative estimate of 146,932 hallucinated citations in 2025 alone. Output Quality	positive	high	count of hallucinated citations in 2025	n=111000000 146,932 hallucinated citations in 2025 alone 0.8
These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake. Output Quality	positive	high	rate/prevalence of hallucinated references by research field	n=2500000 0.48
Hallucinated references are especially pronounced in manuscripts with linguistic signatures of AI-assisted writing. Output Quality	positive	high	association between AI-writing linguistic signatures and presence of hallucinated references	n=2500000 0.48
Hallucinated references are especially pronounced among small and early-career author teams. Research Productivity	positive	high	rate of hallucinated references by team size and author career stage	n=2500000 0.48
Hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting LLM-generated errors may reinforce existing inequities in scientific recognition. Inequality	positive	high	distribution of (hallucinated) citation credit by cited-author prominence and gender	n=111000000 0.48
Preprint moderation and journal publication processes capture only a fraction of these errors. Governance And Regulation	negative	high	fraction of hallucinated references detected/removed by moderation and publication processes	n=2500000 0.48
LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature. Output Quality	negative	high	risk to reliability and equity of scientific discovery (qualitative assessment)	0.08