AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

Summary

Main Finding

For the drug-asset valuation task the limiting factor for an "AI Scientist" agent is the evidence substrate, not just reasoning scaffolds or prompting. Non‑proprietary skills and public structured tools (playbooks, verifiers, objectivity charter, adversarial red-team) improve calibration and discipline, but a proprietary curated commercial-intelligence database (Noah AI) sets the factual ceiling and therefore the decision-utility ceiling. With the proprietary data present the agent recovers nearly the full curated competitive record and produces high “informed” decision quality; without it the agent is severely coverage-limited even if its reasoning is good.

Key Points

Experimental design: a controlled three‑arm ablation holding model/backbone constant (Claude Code / Opus 4.8 MAX and a default_14dim playbook) while varying (1) web-only baseline (A), (2) +skills + public APIs (B), (3) +proprietary Noah AI data (C).
Benchmark: 13 single‑target drug assets stratified into 5 archetypes (S1 validated‑crowded, S2 validated×white‑space, S3 uncrowded but unvalidated trap, S4 deal‑sensitive, S5 biological void).
Metrics separated into three axes:
- Axis 1 factual coverage (R_gold against a curated proprietary gold set; long‑tail recall; deal recall).
- Axis 2 methodology/objectivity (eight‑law objectivity rubric; key‑fact coverage).
- Axis 3 decision quality (tier‑in‑range, false‑GO rate, blind-panel decision‑quality DQ, and informed‑DQ = DQ × R_gold).
Headline numeric results (A / B / C):
- Recall vs curated gold (R_gold): 0.25 / 0.38 / 0.96
- Long‑tail recall: 0.26 / 0.30 / 0.93
- Deal recall: 0.81 / 0.88 / 0.92
- Objectivity (0–4): 3.16 / 3.30 / 3.60
- Tier‑in‑range: 0.80 / 0.89 / 0.89
- Decision‑quality (0–10, blind panel): 7.01 / 6.96 / 7.65
- Informed decision‑quality (DQ × R_gold): 1.76 / 2.57 / 7.43
Completeness‑aware utility: even with perfect prose (DQ = 10), a non‑proprietary stack capped B’s per‑cell informed utility at 10 × 0.383 = 3.83 — well below observed C performance. Thus improved reasoning cannot substitute for missing facts.
Capability‑superset accounting: factual capacity was generously unioned across arms (A ⊂ B ⊂ C) to avoid undercrediting lower arms; even under this favorable accounting B only reaches ~0.38 coverage.
Practical conclusion: skills/public tools improve calibration (tier correctness, objectivity) and avoid some systematic errors (e.g., A mis‑calibrated on S2), but proprietary, curated long‑tail evidence is required for near‑complete competitive situational awareness and high informed decision utility.

Data & Methods

System and arms:
- Backbone: Claude Code with Opus 4.8 MAX; identical compute/time budgets across arms.
- A (plain): web search only, no rubric/skills.
- B (+skills + public tools): default_14dim playbook (planning, public APIs: ClinicalTrials.gov, PubMed, OpenTargets, OpenFDA), citation verifier, objectivity charter, adversarial red‑team; proprietary data starved.
- C (+data): same as B but with Noah AI proprietary exports queried first.
Noah AI corpus statistics (available to C): ~5,322 programs, 16,547 trials, 476 deals; target‑scoped slices used for competition queries.
Benchmark: 13 assets, 5 strata chosen to expose different expected failure modes (coverage vs calibration). 28–30 scored asset–seed cells per arm after exclusions.
Gold keys: constructed from the proprietary commercial‑intelligence family used by C and audited/enriched by industry experts. Purposefully measures recovery of a curated proprietary diligence record (R_gold), not an independent ground truth for long‑term clinical payoffs.
Scoring:
- Auto metrics based on deterministic matchers (name/code based) and capability‑superset unioning for factual capacity.
- Blind human judging: 3 independent judges per report, anonymized and shuffled across conditions; judges scored objectivity (eight laws: 0–4 each), key‑fact coverage, and decision‑quality (0–10). Reports were scrubbed of provenance to blind B vs. C where possible; A’s free‑form format was partially identifiable and interpreted alongside auto metrics.
- Informed decision‑quality = blind-panel DQ × R_gold; additional utility proxies (hard ceiling 10×R_gold, pessimistic min proxy, geometric proxy) tested robustness.
Sensitivity: paired analysis on fully paired cells (n=27) preserves the C advantage under both raw and capacity recall accounting.
Limitations explicitly noted by authors: gold set is not an independent ground truth (it was built from the proprietary corpus); small asset sample; single backbone model; web_extra was empty in this study so R_union = R_gold.

Implications for AI Economics

Value of proprietary data assets
- Proprietary curated evidence yields large marginal returns in decision‑utility for high‑stakes knowledge work. Firms owning such datasets can convert model-level competence into substantially higher actionable value.
- Investment in curation and long‑tail ingestion (regional, preclinical, unregistered sources) likely yields outsized economic returns relative to equivalent investment in reasoning scaffolds alone.
Market structure and product strategy
- Bundling models with proprietary datasets is a defensible competitive strategy: reasoning advances are necessary but not sufficient for product differentiation in knowledge‑intensive domains.
- Licensing/custody of curated commercial intelligence becomes a durable moat — difficulty of replicating long‑tail coverage implies persistent advantage and potential for premium pricing.
Evaluation and benchmarking practices
- Standard LLM/agent evaluations that focus on model/prompt/playbook improvements without explicitly controlling data access risk conflating data and reasoning effects. Ablation by data access (as done here) should be routine for domain applications.
- Benchmarks for scientific/commercial tasks should incorporate completeness-aware metrics (informed‑DQ or equivalent) rather than only prose‑level quality measures.
Policy and welfare considerations
- Concentration of curated domain evidence in a few proprietary datasets can create asymmetries in decision quality across organizations, affecting competition in biotech/healthcare markets.
- Public‑interest justification for funding open, high‑quality curated corpora (or standards for data access in regulated decisions) becomes stronger when downstream decision utility materially depends on coverage.
Productization & cost tradeoffs
- Even if cost per token/model call favors public APIs, total utility depends on coverage; firms should weigh licensing curation costs against the marginal loss in informed decision utility and associated economic risk.
- Automation of due diligence and valuation is feasible, but realistically requires paying for curated evidence to reach near‑human (complete) performance rather than relying on better prompting or scaffolding alone.
Research directions relevant to economic agents
- Quantify cost‑benefit: measure ROI of proprietary curation vs. gains from improved models or playbooks.
- Study generality: replicate across other high‑stakes domains (e.g., M&A, litigation, materials) to map where data vs reasoning dominates.
- Policy research: analyze market power implications and potential interventions (data sharing norms, public datasets) that affect welfare.

Caveats and limits to generalization: the “gold” here is derived from the proprietary corpus and expert curation, so the experiment specifically measures recovery of that commercial diligence record. Results strongly support the claim that curated proprietary evidence materially limits achievable decision utility in this task, but external validation with independent ground truth (long‑term outcomes) and larger, more diverse benchmarks would strengthen external generalizability.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The within-agent ablation and multiple objective metrics (tier-in-range accuracy, objectivity, gold-coverage, blind-panel decision quality) provide credible internal evidence that proprietary data drives large gains; however small sample size (13 assets), single domain (drug-asset valuation), proprietary/unreplicable corpus, and limited transparency on benchmark and panel procedures constrain external validity and overall strength. Methods Rigormedium — Design uses a clear controlled ablation, stratified benchmark, and both quantitative and blind-panel qualitative metrics, which is rigorous for an engineering experiment; weaknesses include small N, lack of randomized assignment (not applicable but no cross-validation over larger holdouts), reliance on proprietary gold records, and limited detail about scoring/annotator procedures and statistical uncertainty. SampleA stratified 13-asset benchmark in drug-asset valuation with a curated gold competitive record and a curated 'long-tail' subset; three agent configurations (A: web-only LLM analyst; B: adds public structured tools, a 14-dimension valuation playbook, verifier, objectivity policy, and red-team; C: B plus proprietary Noah AI corpus of curated pipeline, trial, and deal intelligence). Metrics include tier-in-range accuracy, objectivity scores, gold-coverage, blind-panel decision-quality, and derived completeness-aware decision utility. Themesproductivity human_ai_collab IdentificationControlled three-arm ablation comparing the same production valuation agent under three evidence-access treatments (A: web-only LLM; B: web + public structured tools + playbook/verifier/policies; C: B + proprietary Noah AI corpus) across a stratified 13-asset benchmark, with performance differences attributed to incremental evidence access while holding model and scaffolds constant. GeneralizabilitySmall sample (13 assets) limits statistical generality., Single domain (biopharma/drug-asset valuation) — may not generalize to other industries., Proprietary Noah AI corpus is not publicly available, limiting reproducibility and transferability., Agent implementation details and base LLM unspecified or proprietary, so results may depend on specific system architecture., Benchmark curation and gold labels may embed selection bias favoring the proprietary corpus., Metrics like objectivity and blind-panel ratings may be sensitive to panel composition and scoring protocols.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
For knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. Decision Quality	positive	high	AI Scientist capability limited by available evidence substrate (affecting decision quality)	n=13 0.48
We ran a controlled three-arm ablation on a production valuation agent: A = plain web-only LLM analyst; B = adds public structured tools + a 14-dimension valuation playbook, verifier, objectivity policy and red-team; C = adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Other	neutral	high	experimental treatment definitions (method)	n=13 0.8
Across a 13-asset stratified benchmark, adding the public structured tools and playbook (B) improves tier-in-range accuracy from 0.80 to 0.89 (vs. A). Decision Quality	positive	high	tier-in-range accuracy	n=13 0.80 to 0.89 0.48
Across the same benchmark, B increases objectivity from 3.16 to 3.30 (vs. A). Decision Quality	positive	high	objectivity score	n=13 3.16 to 3.30 0.48
Under capability-superset accounting on the curated gold competitive record, agent A recovers only 0.25, agent B recovers 0.38, while agent C recovers 0.96 (overall). Decision Quality	mixed	high	fraction of curated gold competitive record recovered (gold-coverage)	n=13 A: 0.25; B: 0.38; C: 0.96 0.48
On the curated long-tail subset, agent C reaches 0.93 recovered coverage vs. A 0.26 and B 0.30. Decision Quality	positive	high	fraction of curated gold record recovered on long-tail subset (gold-coverage)	C: 0.93; A: 0.26; B: 0.30 0.48
Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96). Decision Quality	null_result	high	raw blind-panel decision-quality score	7.01 vs. 6.96 0.48
Introducing completeness-aware decision utility (informed decision-quality = decision-quality × gold-coverage), C reaches 7.43 vs. A 1.76 and B 2.57 on this metric. Decision Quality	positive	high	informed decision-quality (decision-quality × gold-coverage)	n=13 C: 7.43; A: 1.76; B: 2.57 0.48
Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage (i.e., B imposes an upper bound on non-proprietary informed decision-quality). Decision Quality	negative	high	maximum achievable informed decision-quality for non-proprietary-data reports under B's coverage	3.83 0.48
Reasoning scaffolds (public tools, playbook, verifier, objectivity policy, red-team) improve calibration and audit discipline, but proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide. Decision Quality	mixed	high	calibration/audit discipline improvements vs. upper bound of knowledge/decision capability	n=13 0.48