Large language models struggle to predict experimental outcomes reliably: they reach only 14–26% accuracy—comparable to human experts—but cannot tell when their predictions are trustworthy; human experts, by contrast, are well calibrated and markedly better at identifying which outcomes can be predicted without physical tests.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernandez Montoya, Bing Liu · April 12, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

On a 405-task benchmark of experimental outcomes across physics, biology and chemistry, LLMs achieve 14–26% accuracy (some models exceeding ~20% human expert baseline) but fail to calibrate reliability, whereas human experts show strong calibration with accuracy rising from ~5% to ~80% when they judge outcomes predictable without experiments.

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Summary

Main Finding

SciPredict introduces a large, expert-curated benchmark (405 tasks) for evaluating whether LLMs can predict empirical experimental outcomes in natural sciences (biology, chemistry, physics). State-of-the-art LLMs achieve only 14–26% prediction accuracy (across tasks), roughly comparable to or slightly above human expert accuracy (~20%), and crucially they fail to calibrate their confidence/feasibility judgments. Human experts are well-calibrated (accuracy rises from ≈5% to ≈80% as they rate tasks more predictable); models do not show this behavior, making them unreliable for guiding costly physical experiments.

Key Points

Dataset and scope
- 405 prediction tasks from empirical studies published after Mar 31, 2025 (to avoid training-set leakage).
- Tasks span 33 sub-fields across physics (9), chemistry (10), and biology (14).
- Question formats: multiple-choice (MCQ), free-form (FF), numerical (NUM).
- Expert-curated background knowledge is included for each task.
Performance results
- 15 SOTA LLMs evaluated (examples: GPT‑5.2, Gemini 3P, Claude O4.5, Llama 3.3 70B, Qwen 3 235B).
- Model accuracies: 14%–26% (task-type and domain dependent).
- Human experts: ≈20% accuracy overall.
- MCQs are easier for models than free-form and numerical questions.
Calibration and reliability
- Models’ self-reported confidence/feasibility scores do not correlate with actual accuracy (~20% regardless of confidence level).
- Human experts show strong calibration: predicted-feasibility ratings map well to empirical accuracy (accuracy increases from ≈5% to ≈80% across feasibility bins).
- Models therefore cannot reliably flag when their predictions are trustworthy.
Background knowledge effects
- Expert-curated background knowledge improves model accuracy modestly (~+3% on average; range ~1.2–5.8% depending on model).
- Models’ self-generated background knowledge often harms performance; combining self-generated with expert-curated usually underperforms.
- Primary failure modes: factual errors and flawed logical reasoning, not purely misunderstanding the task.
Curation effort and reproducibility
- Benchmark construction cost ≈ $336k and 7,380 human expert hours.
- All data and code released: https://github.com/scaleapi/scipredict.

Data & Methods

Data collection and curation
- Source: empirical papers (post-March 31, 2025) in natural sciences.
- Multi-stage expert annotation: extract experimental setup, interventions, measurements, and ground-truth outcomes.
- Tasks formatted into MCQ / FF / NUM with accompanying rubrics (1–10 rubrics for free-form), acceptable numeric ranges for NUM, and ground-truth labels for MCQ.
Quality control
- Iterative expert review and deterministic plus LLM-assisted verifications to ensure task precision without revealing answers.
- Focus on making tasks challenging but solvable with the right contextual/prior knowledge.
Evaluation
- Models supplied: contextual experimental description and various background-knowledge conditions (expert-curated, self-generated, filtered, combined).
- Metrics: accuracy (task correctness), calibration assessed by correlation of self-reported confidence/feasibility with accuracy, analyses by question type and scientific domain.
- Human experts provided baseline predictions and self-assessed feasibility/confidence.
Analysis highlights
- Comparison across background-knowledge variants to isolate contributions of curated vs. model-generated priors.
- Domain- and format-specific breakdowns showed consistent degradation from MCQ → FF → NUM.
- Error analysis attributes failures primarily to factual/ reasoning mistakes rather than annotation issues.

Implications for AI Economics

R&D productivity and deployment risk
- Potential value: if reliable, LLMs could filter low-value experiments and accelerate discovery, yielding large cost and time savings in R&D-intensive sectors.
- Current reality: low accuracy and poor calibration mean deploying these LLMs to prioritize physical experiments risks misallocating scarce experimental budgets and may increase wasted spend rather than reduce it.
Investment signals
- Building high-quality experimental-prediction capabilities is costly: SciPredict cost (~$336k, 7,380 expert hours) illustrates nontrivial dataset construction expense. Firms and funders should account for these upfront data/annotation costs when assessing returns on model improvements.
- There is clear economic value in improving calibration and uncertainty estimation—investments directed at calibrated models, better retrieval of expert priors, or hybrid human-AI workflows likely yield higher marginal returns than naive scaling of current models.
Market design and productization
- Short-term product strategy: incorporate LLMs as assistive, low-stakes tools (hypothesis generation, literature triage) with human-in-the-loop validation rather than as autonomous experiment planners.
- Long-term product-market fit requires models that not only improve raw accuracy but can reliably estimate when they are trustworthy (so firms can make portfolio-level decisions about which experiments to fund).
Research management and funding allocation
- Decision models for R&D portfolio optimization must incorporate model calibration. Poorly calibrated models can distort expected-value calculations and lead to overcommitment to high-risk/low-return experiments.
- Policymakers and lab managers should prefer staged adoption: use LLMs to produce candidate experiments but require human vetting and small-scale validation before committing large resources.
Incentives and regulation
- Firms offering experimental-prediction services should be evaluated on calibration metrics and audited benchmarks (like SciPredict) to avoid overclaiming capabilities.
- Liability and reproducibility concerns: false confidence by models can produce costly downstream harms; transparency about uncertainty and limitations should be required in commercial offerings for experimental guidance.
Research priorities for economic impact
- Calibrated uncertainty estimation: methods that make model confidence meaningful will unlock large economic value because managers can trust selective automation.
- Better grounding and retrieval of expert-curated priors: systems that correctly identify and apply relevant prior experiments will increase predictive performance and decrease expert annotation costs.
- Hybrid workflows: design incentives for human-AI teams—compensation, interfaces, and process changes—that leverage models’ strengths while controlling for calibration failures.

Suggested actionables for decision-makers - Treat current LLMs as augmentation, not automation: invest in human-in-the-loop processes and small-scale validation experiments before using model predictions to allocate significant funds. - Prioritize funding and procurement for models that demonstrate both improved accuracy and demonstrable calibration on benchmarks like SciPredict. - Factor dataset/annotation costs and repeatability audits into ROI calculations for deploying experimental-prediction tools.

If useful, I can (a) extract the specific domain-level accuracy numbers and per-model results from the paper to quantify heterogeneity across subfields, or (b) draft a brief checklist for R&D managers evaluating whether to pilot LLM-based experimental prediction in their labs. Which would you prefer?

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides a systematic, reproducible benchmark (405 tasks across 33 specialized subfields) with head-to-head comparisons between multiple LLMs and human experts, giving credible descriptive evidence about model performance and calibration; however, it does not establish causal effects on scientific productivity or real-world research outcomes and may be affected by task selection and sample limitations. Methods Rigormedium — The study appears methodical (large curated benchmark, multiple model evaluations, human expert comparisons, and code/data release for reproducibility), but important methodological details that affect rigor—such as criteria for task selection, the representativeness and number of human experts, exact model versions and prompting protocols, and robustness checks—are not fully described in the abstract, leaving room for selection bias and measurement concerns. SampleA benchmark (SciPredict) of 405 prediction tasks derived from recent empirical studies spanning 33 specialized subfields in physics, biology, and chemistry; evaluations include multiple large language models (including some frontier models) and a set of human experts (performance reported at ~20%), with all data and code released for reproducibility. Themesproductivity human_ai_collab GeneralizabilityTasks are drawn from published empirical studies and may not represent the full space of lab experiments or routine R&D decisions (selection bias toward tractable/interesting published results)., 405 tasks, while diverse, may be small relative to the heterogeneity of experimental protocols and scientific domains., Human expert sample size and composition are unclear, limiting inference about general expert populations., Model performance depends on specific model versions, prompting, and context not detailed here—results may change as models update., Prediction tasks likely abstract away experimental context, measurement noise, and practical constraints present in real lab decision-making, limiting ecological validity., Findings concern predictive accuracy and calibration, not downstream impacts on research productivity, cost savings, or adoption in lab workflows.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. Research Productivity	positive	high	benchmark_size_and_scope	n=405 405 tasks; 33 specialized sub-fields 0.3
Model accuracies on SciPredict are 14-26%. Decision Quality	negative	high	prediction_accuracy	n=405 14-26% 0.3
Human expert performance on the benchmark is approximately 20%. Decision Quality	negative	high	prediction_accuracy	n=405 ≈20% 0.3
Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance. Research Productivity	mixed	high	prediction_accuracy / usability_for_guidance	n=405 0.18
Models fail to distinguish reliable predictions from unreliable ones, achieving only ≈20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Decision Quality	negative	high	calibration_of_confidence_vs_accuracy	n=405 ≈20% 0.3
Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment. Decision Quality	positive	high	calibration_of_human_confidence_vs_accuracy	n=405 ≈5% to ≈80% 0.3
SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Research Productivity	positive	high	research_questions_addressed	n=405 0.18
For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict Research Productivity	positive	high	data_and_code_availability	0.3
Existing benchmarks evaluate LLMs on scientific knowledge and reasoning, but their ability to predict experimental outcomes remains largely underexplored. Research Productivity	neutral	medium	scope_of_existing_benchmarks	0.11