AI peer feedback nudges scientists to improve their papers: sending LLM-generated critiques to over 31,000 arXiv preprints raised revision rates by about 12.6% and increased later use of LLM tools. The benefits were concentrated among authors in non-English regions, earlier-career teams and less-cited manuscripts, suggesting AI can widen access to timely scientific critique.

Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

Binglu Wang, Weixin Liang, Jiahui Xue, Yuhui Zhang, Hancheng Cao, Dashun Wang, Yian Yin · May 22, 2026

arxiv rct high evidence 9/10 relevance Source PDF

A randomized field experiment sending LLM-generated, customized feedback to over 31,000 arXiv preprints increased the probability of authors revising their manuscripts by 12.55% (relative) and raised subsequent LLM use, with largest gains for non-English-region authors, earlier-career teams, and less-embedded manuscripts.

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.

Summary

Main Finding

A global, large-scale randomized field experiment shows that delivering customized LLM-generated feedback to authors of arXiv preprints causally increases short-term revision activity and raises subsequent adoption of LLM tools. Effects are modest in absolute terms but meaningful at scale and concentrated among authors and manuscripts with less access to traditional feedback (non–English-dominant regions, low scholarly embeddedness, lower h-index, earlier career stage).

Key Points

Experiment scope and scale
- Sample: preprints first posted on arXiv Jan–Jun 2024. Initial cohort: 34,340 manuscripts; analysis sample (papers that had not already been revised before feedback delivery): 31,020 manuscripts (treatment N = 15,542; control N = 15,478).
- Contacted authors: 45,466 across 133 geographic regions and 150 fields.
- Treatment: email with private link to a webpage containing customized LLM-generated feedback produced under a fixed multi-agent protocol (review analysis, alternative titles, grammar checks, in-context tailoring). Controls received no contact.
- Feedback delivery: June and August 2024 (two cohorts).
Primary outcomes
- Short-term revisions: number of updated arXiv versions within one month after feedback delivery.
- Long-term adoption: subsequent use of LLMs in authors’ papers over 12 months, measured via a published AI-detection model that assigns an “α” score estimating LLM-generated fraction.
Main quantitative results
- Receiving AI feedback produced an average increase of 0.005 revisions within one month (intent-to-treat), a 12.55% relative increase over baseline revision rates (p < 0.05).
- Treated authors with minimal prior LLM use showed higher subsequent LLM adoption intensity (a ~5.3% relative increase over controls within 12 months; p < 0.05).
Heterogeneity (effects concentrated among disadvantaged/less-embedded groups)
- Non–English-dominant institutional affiliation: revisions +0.007 (≈19.9% relative increase); LLM adoption +8.20% (p < 0.01).
- Lower scholarly embeddedness (fewer references): revisions +0.007 (≈26.4% relative increase); LLM adoption +6.34% (p < 0.05).
- Lower h-index: revision and adoption effects (revision ≈20.2% relative increase; adoption ≈6.98%).
- Earlier career age: revisions ≈18.9% increase; adoption ≈6.16%.
- For higher-category/advantaged groups (English-dominant, high embeddedness, high h-index, later-career) effects were small and not statistically distinguishable from zero.
Nature of revisions
- Computational comparison of original vs revised manuscripts indicates increases in substantive conceptual edits (e.g., ethics, novelty) rather than only surface-level or formatting changes.
Robustness checks & caveats
- Effects persist across alternative post-treatment windows (1–2 months for revisions; 6–12 months for adoption) and alternative detection thresholds.
- Results are intent-to-treat (IT T) and thus conservative; they include authors who may not have opened or engaged with feedback.
- No placebo-email arm (control is no-contact), so attentional vs content mechanisms cannot be fully disentangled; content analyses and sustained adoption patterns argue that content mattered.
- LLM-detection models are noisy and may have differential accuracy (e.g., for non-native English), though randomization should balance such biases.

Data & Methods

Data
- Source: arXiv preprints posted Jan–Jun 2024; version histories and metadata; authors’ email contacts and affiliations.
- Outcome data: arXiv version updates (timestamped), full-text diffs for content-classification, subsequent arXiv submissions within 12 months for LLM-detection on abstracts.
Randomization and design
- Paper-level randomization stratified by 150 fields; risk-set restricted to papers that remained as first versions at time of feedback delivery (to avoid post-treatment contamination).
- Treatment delivery: single light-touch intervention (email → private webpage with tailored LLM feedback).
Feedback generation
- A state-of-the-art LLM framework with a multi-agent pipeline: review analysis, alternative titles, grammar checks, and in-context learning to match paper content; feedback examples and prompts detailed in supplemental materials.
Analysis
- Primary specification: OLS intent-to-treat regressions (Revision_i = β0 + β1 × Treatment_i + ε_i), and analogous author-level models for adoption.
- Heterogeneity assessed across four moderators: institutional language environment (English-dominant vs non–English-dominant), scholarly embeddedness (citation count in initial version), career age, and h-index.
- Supplementary pipelines: automated diffing to classify edits (substantive vs surface-level); use of a published AI detection model for adoption measurement.
Ethics & approvals
- IRB approval from Northwestern University (STU00220102); feedback links were encrypted and access-restricted for privacy.

Implications for AI Economics

Reallocating scarce feedback as an economic resource
- Feedback is an essential, unevenly distributed input in knowledge production. Structured LLM interventions can act as a scalable mechanism to reallocate this scarce resource, increasing productive activity especially where market or network provision of critique is weak.
Productivity and complementarities
- The intervention produced modest absolute gains but meaningful relative increases at scale. This suggests LLMs can complement human effort by raising the marginal productivity of authors who previously lacked timely critique, potentially increasing research throughput and quality (conditional on quality of revisions).
Diffusion and adoption dynamics
- A single exposure both changed immediate behavior and increased later tool adoption—evidence that supply-side interventions (providing AI feedback) can accelerate demand/adoption, with multiplier effects on technology diffusion in research communities.
Distributional effects and equity
- Benefits concentrated among less advantaged researchers (non–English regions, early-career, low h-index, low embeddedness) imply LLM deployments could reduce access inequalities in scientific feedback, affecting the distribution of future scientific output and human-capital accumulation.
Market structure and service provision
- The results point to potential markets for centralized AI-mediated review/feedback services (institutions, funders, publishing platforms) and to strategic complementarities between such services and human peer-review/training. Pricing, bundling, and regulation of such services will matter for equitable access.
Policy and investment considerations
- Public or philanthropic investment in AI feedback infrastructure could yield high social returns by amplifying under-served researchers’ productivity. Regulators and institutions should consider standards for transparency, quality control, and validation to ensure credibility of AI feedback in scientific contexts.
Risks and research gaps
- Credibility, error-risk, and potential dependency: LLM feedback could propagate errors or foster over-reliance if not paired with human oversight. Economic analyses should study longer-run quality outcomes (publication success, citation impact), potential displacement or upskilling of intermediary human reviewers, and incentive effects on collaborative networks.
Priority directions for economic research
- Cost-effectiveness studies: unit costs of automated feedback vs gains in output/quality.
- Long-run impact on career trajectories, institution-level productivity, and global inequality in science.
- Market design for AI-assisted scientific services, including reputation mechanisms, certification, and integration with peer review.
- Welfare analysis weighing gains in access and productivity against risks of misinformation, gaming, or erosion of norms.

Short summary: A randomized field deployment of tailored LLM feedback to >31,000 arXiv preprints causally increased revision activity (12.6% relative) and modestly raised later LLM adoption (≈5% relative), with effects concentrated among authors lacking conventional access to critique—indicating LLMs can reallocate feedback and potentially reduce barriers in the global research system, with important economic implications for productivity, equity, and markets for AI-mediated services.

Assessment

Paper Typerct Evidence Strengthhigh — Large-scale, real-world randomized field experiment (31,000+ preprints, >45,000 authors) with objective behavioral outcomes (manuscript revision recorded on arXiv) and heterogeneous treatment effects that align with theory; randomization provides credible causal identification. Remaining caveats (see methods/generalizability) are secondary to the strong design. Methods Rigorhigh — Well-powered RCT with preprint-level outcomes, broad coverage across fields and geographies, and exploration of heterogeneous effects; measurement uses observable platform behavior rather than self-report. Potential methodological weaknesses include reliance on revision occurrence rather than independent quality assessment, possible spillovers/contamination, and sensitivity to the particular LLM version and delivery protocol. Sample31,000+ arXiv preprints spanning ~150 fields and authors from 133 geographic regions (over 45,000 researchers); treatment delivered at the manuscript level as customized LLM-generated feedback; outcomes include probability of revising the manuscript and subsequent use of LLM tools in later papers. Themesproductivity human_ai_collab adoption inequality IdentificationRandomized controlled field experiment: arXiv preprints (unit of randomization) were randomly assigned to receive customized LLM-generated feedback or to a control condition; causal effects are identified by random assignment and measured via objective downstream behaviors (manuscript revisions and later LLM use). GeneralizabilitySample limited to arXiv preprints and fields that use arXiv heavily (e.g., physics, computer science); may not generalize to disciplines that do not use preprints., arXiv authors may differ from broader researcher populations (more open, technical, or English-proficient), limiting external validity to industry researchers or closed-submission contexts., Measured outcome is revision incidence on arXiv, not an independent assessment of revision quality, publication success, or long-term scientific impact., Effect sizes may depend on the specific LLM model/version and feedback design used; results may change as LLMs evolve., Potential cultural or institutional differences in receptiveness to unsolicited feedback could alter effects in other settings.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We conducted a global large-scale randomized field experiment, delivering customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Other	null_result	high	n/a (description of experimental sample and coverage)	n=31000 1.0
Authors who received LLM-generated feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Research Productivity	positive	high	likelihood (probability) of revising manuscripts	n=31000 12.55% relative increase over the baseline revision rate 1.0
Exposure to AI feedback increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. Adoption Rate	positive	high	subsequent use of LLM tools in future papers	n=45000 1.0
Effects of AI feedback were strongest among authors from non-English-dominant research regions. Research Productivity	positive	high	treatment effect on revision likelihood (or other measured outcomes) by region	0.6
Effects were strongest for manuscripts less embedded in the scholarly literature. Research Productivity	positive	high	treatment effect (e.g., revision likelihood) by degree of manuscript embeddedness in literature	0.6
Effects were strongest among teams with lower h-indexes and earlier career stages. Research Productivity	positive	high	treatment effect (e.g., revision likelihood) by team h-index and author career stage	0.6
Structured AI-based interventions provide causal evidence that they can transform access to scientific feedback from a largely private advantage into a more widely distributed resource. Research Productivity	positive	high	access and distribution of scientific feedback (measured via treated authors' behaviors and uptake)	n=31000 0.6
AI feedback may provide the greatest benefit where access to timely critique is otherwise limited (implied by stronger effects in non-English regions, less-embedded manuscripts, lower-h-index teams, and earlier career stages). Inequality	positive	medium	relative benefit of AI feedback across contexts (inferred from heterogeneous effects)	0.06
These findings have broader implications for productivity, equity, and capacity across the global research system. Governance And Regulation	mixed	high	productivity, equity, and system capacity (broad policy/interpretive outcome)	n=31000 0.1