Large language models struggle to predict experimental outcomes reliably: they reach only 14–26% accuracy—comparable to human experts—but cannot tell when their predictions are trustworthy; human experts, by contrast, are well calibrated and markedly better at identifying which outcomes can be predicted without physical tests.
Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Summary
Main Finding
SciPredict introduces a large, expert-curated benchmark (405 tasks) for evaluating whether LLMs can predict empirical experimental outcomes in natural sciences (biology, chemistry, physics). State-of-the-art LLMs achieve only 14–26% prediction accuracy (across tasks), roughly comparable to or slightly above human expert accuracy (~20%), and crucially they fail to calibrate their confidence/feasibility judgments. Human experts are well-calibrated (accuracy rises from ≈5% to ≈80% as they rate tasks more predictable); models do not show this behavior, making them unreliable for guiding costly physical experiments.
Key Points
-
Dataset and scope
- 405 prediction tasks from empirical studies published after Mar 31, 2025 (to avoid training-set leakage).
- Tasks span 33 sub-fields across physics (9), chemistry (10), and biology (14).
- Question formats: multiple-choice (MCQ), free-form (FF), numerical (NUM).
- Expert-curated background knowledge is included for each task.
-
Performance results
- 15 SOTA LLMs evaluated (examples: GPT‑5.2, Gemini 3P, Claude O4.5, Llama 3.3 70B, Qwen 3 235B).
- Model accuracies: 14%–26% (task-type and domain dependent).
- Human experts: ≈20% accuracy overall.
- MCQs are easier for models than free-form and numerical questions.
-
Calibration and reliability
- Models’ self-reported confidence/feasibility scores do not correlate with actual accuracy (~20% regardless of confidence level).
- Human experts show strong calibration: predicted-feasibility ratings map well to empirical accuracy (accuracy increases from ≈5% to ≈80% across feasibility bins).
- Models therefore cannot reliably flag when their predictions are trustworthy.
-
Background knowledge effects
- Expert-curated background knowledge improves model accuracy modestly (~+3% on average; range ~1.2–5.8% depending on model).
- Models’ self-generated background knowledge often harms performance; combining self-generated with expert-curated usually underperforms.
- Primary failure modes: factual errors and flawed logical reasoning, not purely misunderstanding the task.
-
Curation effort and reproducibility
- Benchmark construction cost ≈ $336k and 7,380 human expert hours.
- All data and code released: https://github.com/scaleapi/scipredict.
Data & Methods
-
Data collection and curation
- Source: empirical papers (post-March 31, 2025) in natural sciences.
- Multi-stage expert annotation: extract experimental setup, interventions, measurements, and ground-truth outcomes.
- Tasks formatted into MCQ / FF / NUM with accompanying rubrics (1–10 rubrics for free-form), acceptable numeric ranges for NUM, and ground-truth labels for MCQ.
-
Quality control
- Iterative expert review and deterministic plus LLM-assisted verifications to ensure task precision without revealing answers.
- Focus on making tasks challenging but solvable with the right contextual/prior knowledge.
-
Evaluation
- Models supplied: contextual experimental description and various background-knowledge conditions (expert-curated, self-generated, filtered, combined).
- Metrics: accuracy (task correctness), calibration assessed by correlation of self-reported confidence/feasibility with accuracy, analyses by question type and scientific domain.
- Human experts provided baseline predictions and self-assessed feasibility/confidence.
-
Analysis highlights
- Comparison across background-knowledge variants to isolate contributions of curated vs. model-generated priors.
- Domain- and format-specific breakdowns showed consistent degradation from MCQ → FF → NUM.
- Error analysis attributes failures primarily to factual/ reasoning mistakes rather than annotation issues.
Implications for AI Economics
-
R&D productivity and deployment risk
- Potential value: if reliable, LLMs could filter low-value experiments and accelerate discovery, yielding large cost and time savings in R&D-intensive sectors.
- Current reality: low accuracy and poor calibration mean deploying these LLMs to prioritize physical experiments risks misallocating scarce experimental budgets and may increase wasted spend rather than reduce it.
-
Investment signals
- Building high-quality experimental-prediction capabilities is costly: SciPredict cost (~$336k, 7,380 expert hours) illustrates nontrivial dataset construction expense. Firms and funders should account for these upfront data/annotation costs when assessing returns on model improvements.
- There is clear economic value in improving calibration and uncertainty estimation—investments directed at calibrated models, better retrieval of expert priors, or hybrid human-AI workflows likely yield higher marginal returns than naive scaling of current models.
-
Market design and productization
- Short-term product strategy: incorporate LLMs as assistive, low-stakes tools (hypothesis generation, literature triage) with human-in-the-loop validation rather than as autonomous experiment planners.
- Long-term product-market fit requires models that not only improve raw accuracy but can reliably estimate when they are trustworthy (so firms can make portfolio-level decisions about which experiments to fund).
-
Research management and funding allocation
- Decision models for R&D portfolio optimization must incorporate model calibration. Poorly calibrated models can distort expected-value calculations and lead to overcommitment to high-risk/low-return experiments.
- Policymakers and lab managers should prefer staged adoption: use LLMs to produce candidate experiments but require human vetting and small-scale validation before committing large resources.
-
Incentives and regulation
- Firms offering experimental-prediction services should be evaluated on calibration metrics and audited benchmarks (like SciPredict) to avoid overclaiming capabilities.
- Liability and reproducibility concerns: false confidence by models can produce costly downstream harms; transparency about uncertainty and limitations should be required in commercial offerings for experimental guidance.
-
Research priorities for economic impact
- Calibrated uncertainty estimation: methods that make model confidence meaningful will unlock large economic value because managers can trust selective automation.
- Better grounding and retrieval of expert-curated priors: systems that correctly identify and apply relevant prior experiments will increase predictive performance and decrease expert annotation costs.
- Hybrid workflows: design incentives for human-AI teams—compensation, interfaces, and process changes—that leverage models’ strengths while controlling for calibration failures.
Suggested actionables for decision-makers - Treat current LLMs as augmentation, not automation: invest in human-in-the-loop processes and small-scale validation experiments before using model predictions to allocate significant funds. - Prioritize funding and procurement for models that demonstrate both improved accuracy and demonstrable calibration on benchmarks like SciPredict. - Factor dataset/annotation costs and repeatability audits into ROI calculations for deploying experimental-prediction tools.
If useful, I can (a) extract the specific domain-level accuracy numbers and per-model results from the paper to quantify heterogeneity across subfields, or (b) draft a brief checklist for R&D managers evaluating whether to pilot LLM-based experimental prediction in their labs. Which would you prefer?
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. Research Productivity | positive | high | benchmark_size_and_scope |
n=405
405 tasks; 33 specialized sub-fields
0.3
|
| Model accuracies on SciPredict are 14-26%. Decision Quality | negative | high | prediction_accuracy |
n=405
14-26%
0.3
|
| Human expert performance on the benchmark is approximately 20%. Decision Quality | negative | high | prediction_accuracy |
n=405
≈20%
0.3
|
| Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance. Research Productivity | mixed | high | prediction_accuracy / usability_for_guidance |
n=405
0.18
|
| Models fail to distinguish reliable predictions from unreliable ones, achieving only ≈20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Decision Quality | negative | high | calibration_of_confidence_vs_accuracy |
n=405
≈20%
0.3
|
| Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment. Decision Quality | positive | high | calibration_of_human_confidence_vs_accuracy |
n=405
≈5% to ≈80%
0.3
|
| SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Research Productivity | positive | high | research_questions_addressed |
n=405
0.18
|
| For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict Research Productivity | positive | high | data_and_code_availability |
0.3
|
| Existing benchmarks evaluate LLMs on scientific knowledge and reasoning, but their ability to predict experimental outcomes remains largely underexplored. Research Productivity | neutral | medium | scope_of_existing_benchmarks |
0.11
|