The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Large language models generate plausible but cautious research ideas that echo authors' thinking and rarely suggest null hypotheses, scientists find; automated judges misalign with experts, but a reward model trained on human ratings meaningfully narrows the gap.

Contemporary AI lacks the imagination to diverge or negate in science
Honglin Bao, Siyang Wu, Xiao Liu, Sida Li, Shiyun Cao, James A. Evans · June 06, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
In the largest scientist-in-the-loop test to date, LLMs produce plausible but conservative research ideas that often converge on similar suggestions and fail to propose null hypotheses, are judged less useful in pluralistic fields like social sciences, and are weakly evaluated by automated judges—though a human-trained reward model significantly improves alignment with expert taste.

Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.

Summary

Main Finding

Large-scale, scientist-in-the-loop evaluation shows contemporary LLMs expand ideation but lack scientific imagination in key ways: non-reasoning models collapse into a narrow “hivemind,” reasoning models generate more diverse hypotheses but rarely propose null hypotheses, and automated evaluators (LLM-as-judge and standard metrics) correlate weakly with expert judgment. Carefully post-trained reward models (Qwen3-14B) can recover much of expert taste and approach peer-review agreement, but the results overall imply AI is a useful collaborator rather than an autonomous discovery engine.

Key Points

  • Scope and scale
    • 121,640 non-arXiv preprints (post-2023) across biology, medicine, chemistry, and social sciences assembled.
    • 6,749 authors responded, producing 25,139 rated AI-generated hypothesis evaluations.
  • What models do
    • 26 LLMs tested: 19 non-reasoning chat models, 5 reasoning models, 2 agentic systems.
    • Non-reasoning LLMs produce highly similar ideas (high within-group cosine similarity → “hivemind”).
    • Reasoning models produce more diverse hypotheses (larger spread in idea space) but still rarely generate null hypotheses (explicit claims of no effect).
  • Expert judgments and biases
    • Authors rated each AI idea on novelty, empirical feasibility, probability of being true, and adoption favorability.
    • Scientists prefer ideas similar to their own work; similarity increases perceived feasibility and probability but reduces perceived novelty.
    • Probability (likelihood of being true) dominates adoption decisions; novelty and feasibility matter less — scientists are risk-averse, especially in life sciences.
    • Senior researchers are more skeptical and less willing to adopt AI ideas; social sciences show lower AI idea quality ratings (novelty/feasibility/probability) than other domains.
  • Automated evaluators
    • Off-the-shelf LLM evaluators and existing automated metrics correlate poorly with expert ratings (no model exceeded r = 0.35 on dimensions tested).
    • Reward models from public benchmarks performed near chance on pairwise preference tests.
  • Reward model success
    • A Qwen3-14B reward model was post-trained on human labels (Bradley–Terry objective) and substantially outperformed SOTA baselines.
    • Held-out test pairwise accuracies (examples): biology novelty 69%, feasibility 62%, probability 67%; social-science novelty 64%, feasibility 62%, probability 67%.
    • Best external baselines scored near chance (≈50–55%); Qwen3 models improved up to 27% (≈+14 percentage points average).
    • The Qwen3 reward models exceeded an empirical peer-review agreement baseline (≈61% agreement among reviewers).
  • Structural limitations
    • LLMs reflect training-data selection bias: published literature underreports nulls and failed experiments; models therefore undergenerate null hypotheses.
    • Reasoning helps but cannot recover absent negative knowledge that never entered corpora.
    • Models are optimized to predict (reproduce distribution), not to exhibit curiosity (seek anomalies or nulls).

Data & Methods

  • Corpus
    • 121,640 empirical preprints from BioRxiv (68%), MedRxiv (20%), SocArXiv/PsyArXiv/EdArXiv (9%), ChemRxiv (3%); arXiv intentionally excluded to avoid training-corpus leakage.
  • Hypothesis extraction & leakage control
    • Used reasoning model o3-mini to extract paper context, puzzles, and author hypotheses.
    • Paraphrase-based detector used: removed any context/puzzle sentences whose MPNet embedding similarity to the human hypothesis exceeded 0.82 (calibrated on 20k controlled paraphrases). Author confirmation rates: context/puzzle 99.7%, hypotheses 98.6%.
  • Model panel & prompting
    • 26 LLMs from eight providers; each paper’s authors received the five most semantically distinguishable AI-generated hypotheses (balancing reasoning and non-reasoning outputs).
  • Human evaluation
    • Authors rated each hypothesis on novelty, empirical feasibility, probability of being true, and adoption favorability after comprehension checks. Total valid four-dimension evaluations: 25,139.
  • Analyses & classifiers
    • Idea-space analysis: embeddings, pairwise cosine similarity, t-SNE/UMAP visualizations; reasoning models occupy broader idea space.
    • Null-hypothesis classifier: ensemble (SVM, logistic regression, random forest, gradient boosting), trained and cross-validated (99.5% accuracy on null detection).
    • Adoption modeled via Mundlak correlated random-effects to separate within- and between-author effects (controls for seniority, field, prior AI use).
  • Automated evaluator benchmarking
    • Evaluated LLM-as-judge, n-gram novelty, cross-entropy/perplexity, semantic distance, entailment, GraphEval-GNN, and public reward models (RewardBench) on within-scientist pairwise preference tests (5k held-out pairs).
    • Qwen3-14B post-trained under Bradley–Terry objective with margin; field-specific and general models evaluated.
  • Peer-review baseline
    • Derived from 26,731 OpenReview submissions across 46 conferences (2017–2025), yielding a non-overlapping reviewer agreement baseline ≈61.0%.

Implications for AI Economics

  • Impact on productivity estimates and ROI
    • Macro and microeconomic models that forecast AI-driven acceleration in scientific output should adjust expectations: current LLMs expand search and ideation but do not autonomously generate high-impact, field-defining breakthroughs.
    • Productivity gains will likely be realized through augmentation (time savings, ideation support), not full automation of hypothesis generation and validation. Valuation of AI tools should account for required human grounding and curation costs.
  • Heterogeneous adoption and labor effects
    • Adoption is field- and seniority-dependent. Life sciences and medicine may adopt AI ideas more readily (risk-averse directions emphasizing probability), while pluralistic fields (social sciences) show lower model usefulness.
    • Senior researchers’ skepticism and gatekeeping matter: organizational adoption of AI workflows depends on norms among established scientists; diffusion models must incorporate social-network and status effects.
  • Value of human feedback & specialized training data
    • The big performance gap between off-the-shelf evaluators and reward models trained on expert labels shows large returns to investing in human-in-the-loop labeling and domain-specific preference data.
    • Economically, firms and labs that internalize or purchase high-quality, domain-specific evaluation models (trained on expert judgments) gain a comparative advantage.
  • Importance of data that economics often ignores
    • LLMs underrepresent nulls and failed experiments because those outcomes are seldom published. The economic value of infrastructure that captures negative results (registered reports, replication archives, lab notebooks, failed experiment databases) is high: incorporating such data into training corpora can materially change model outputs and raise social returns to R&D.
    • Funding and policies that subsidize the recording and sharing of negative results increase the informativeness of science-focused AI and reduce systemic bias towards positive findings.
  • Product and policy design
    • AI tools for research should be marketed and priced as augmentation platforms requiring expert oversight — business models should include expert-in-the-loop services (curation, validation).
    • Regulators and funders should be cautious about claims that AI alone will accelerate discovery; evaluation frameworks must require scientist-in-the-loop validation and domain-specific benchmarks.
    • Incentives to build “curiosity-driven” AI systems (architectures or objectives that prize novelty, falsification, and active data acquisition) represent a promising R&D direction with potentially high social value.
  • Market differentiation and public goods
    • There is room for niche, field-specific reward models and evaluation services; market segmentation is likely (general-purpose LLMs vs. domain-specific provably aligned evaluators).
    • Public investment in shared evaluation datasets of expert judgments could reduce duplication, lower barriers for smaller labs, and create spillovers across sectors.
  • Caution for economic modeling of AI externalities
    • Automated evaluators currently misestimate scientific quality; relying on them in funding allocation, hiring, or performance metrics risks reinforcing biases and rewarding surface fluency over substantive contributions. Economic models of AI externalities should incorporate evaluator reliability and misalignment risks.

Overall takeaway for AI economics: current LLMs materially change the inputs to scientific production (scale and variety of candidate ideas) but do not substitute for the expert social and epistemic processes that select, refine, and adopt valuable scientific contributions. Economic analyses, investment strategies, and policy should prioritize human-AI complementarities, domain-specific evaluation capacity, and infrastructure for negative-result data to unlock fuller AI-driven gains.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Large-scale, domain-expert judgments (6,749 scientists, 25,139 rating sets) provide strong descriptive evidence about current LLM outputs and how experts perceive them, but results are observational, rely on subjective ratings, and are subject to selection and contextual limits that reduce causal interpretability and external validity. Methods Rigorhigh — The study uses a very large sampling frame (121,640 invited preprint authors), multi-dimensional expert ratings, systematic comparisons across model classes and augmentations, and development/validation of a post-trained reward model; however, potential response bias, subjective metrics, and limited randomization temper claims. SampleAuthors of 121,640 recent preprints in biology, medicine, chemistry, and the social sciences were invited; 6,749 scientists returned 25,139 sets of ratings evaluating LLM-generated follow-up ideas on novelty, empirical feasibility, probability of being true, and favorability of adoption; multiple LLM families (non-reasoning and reasoning), retrieval-augmented versions, persona prompts, and a Qwen3-14B reward model trained on human ratings were compared. Themeshuman_ai_collab innovation IdentificationComparative evaluation: model-generated follow-up ideas were judged by domain experts and compared across model classes and augmentation strategies; no exogenous variation or randomized assignment is reported to support causal inference. GeneralizabilityRespondent self-selection: respondents are authors who chose to participate and may not represent all active scientists., Preprint-based sampling: ideas generated from preprints may differ from ideation in grant proposals, lab meetings, or long-term research programs., Field heterogeneity: findings differ across disciplines (life sciences vs social sciences), limiting cross-field generalization., Temporal/model drift: results reflect models available at study time and may not hold as LLMs evolve., Subjective ratings: reliance on individual expert judgments may not map to actual downstream research outcomes or productivity.

Claims (12)

ClaimDirectionConfidenceOutcomeDetails
We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas that large language models (LLMs) generated from the context and puzzles of their own papers. Other null_result high number of invited authors (study recruitment)
n=121640
0.3
6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Research Productivity null_result high number of respondents and rating sets
n=6749
25,139 sets of ratings
0.3
Non-reasoning LLMs collapse into a narrow 'hivemind' of similar ideas. Creativity negative high diversity / similarity of generated ideas (creativity)
0.18
Reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses — a move humans make more freely. Creativity mixed high breadth of hypothesis space and frequency of null-hypothesis proposals
0.18
Scientists reward ideas that resemble their own and prize probability over novelty. Research Productivity positive high rating scores for similarity, probability, and novelty
0.18
Social scientists tolerate risk more readily than life scientists. Research Productivity positive high tolerance for risk in idea ratings (preference for novelty vs probability)
0.18
Senior social scientists are the harshest critics, and their skepticism is well-earned. Research Productivity negative high stringency of ratings by seniority and field
0.18
LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Research Productivity negative high LLM performance by field (agreement with expert judgment / idea quality)
0.18
Automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree weakly with expert judgment. Decision Quality negative high agreement/correlation between automated evaluators and expert human judgment
0.18
Retrieval augmentation and scientist persona prompting yield only marginal gains. Research Productivity null_result high change in judged quality due to retrieval augmentation or persona prompting
0.18
A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. Decision Quality positive high automated evaluator performance (alignment with human taste / agreement with reviewers)
up to 27%
0.18
For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding. Research Productivity mixed high overall utility of AI as scientific collaborator (need for human grounding)
0.18

Notes