Access to a proprietary, curated evidence corpus — not just better prompting or tools — determines how well an AI can value drug assets: adding the Noah AI corpus raised coverage and completeness-aware decision-utility nearly fourfold compared with public scaffolds, which mainly improved calibration but left a strict factual ceiling.
AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.
Summary
Main Finding
For the drug-asset valuation task the limiting factor for an "AI Scientist" agent is the evidence substrate, not just reasoning scaffolds or prompting. Non‑proprietary skills and public structured tools (playbooks, verifiers, objectivity charter, adversarial red-team) improve calibration and discipline, but a proprietary curated commercial-intelligence database (Noah AI) sets the factual ceiling and therefore the decision-utility ceiling. With the proprietary data present the agent recovers nearly the full curated competitive record and produces high “informed” decision quality; without it the agent is severely coverage-limited even if its reasoning is good.
Key Points
- Experimental design: a controlled three‑arm ablation holding model/backbone constant (Claude Code / Opus 4.8 MAX and a default_14dim playbook) while varying (1) web-only baseline (A), (2) +skills + public APIs (B), (3) +proprietary Noah AI data (C).
- Benchmark: 13 single‑target drug assets stratified into 5 archetypes (S1 validated‑crowded, S2 validated×white‑space, S3 uncrowded but unvalidated trap, S4 deal‑sensitive, S5 biological void).
- Metrics separated into three axes:
- Axis 1 factual coverage (R_gold against a curated proprietary gold set; long‑tail recall; deal recall).
- Axis 2 methodology/objectivity (eight‑law objectivity rubric; key‑fact coverage).
- Axis 3 decision quality (tier‑in‑range, false‑GO rate, blind-panel decision‑quality DQ, and informed‑DQ = DQ × R_gold).
- Headline numeric results (A / B / C):
- Recall vs curated gold (R_gold): 0.25 / 0.38 / 0.96
- Long‑tail recall: 0.26 / 0.30 / 0.93
- Deal recall: 0.81 / 0.88 / 0.92
- Objectivity (0–4): 3.16 / 3.30 / 3.60
- Tier‑in‑range: 0.80 / 0.89 / 0.89
- Decision‑quality (0–10, blind panel): 7.01 / 6.96 / 7.65
- Informed decision‑quality (DQ × R_gold): 1.76 / 2.57 / 7.43
- Completeness‑aware utility: even with perfect prose (DQ = 10), a non‑proprietary stack capped B’s per‑cell informed utility at 10 × 0.383 = 3.83 — well below observed C performance. Thus improved reasoning cannot substitute for missing facts.
- Capability‑superset accounting: factual capacity was generously unioned across arms (A ⊂ B ⊂ C) to avoid undercrediting lower arms; even under this favorable accounting B only reaches ~0.38 coverage.
- Practical conclusion: skills/public tools improve calibration (tier correctness, objectivity) and avoid some systematic errors (e.g., A mis‑calibrated on S2), but proprietary, curated long‑tail evidence is required for near‑complete competitive situational awareness and high informed decision utility.
Data & Methods
- System and arms:
- Backbone: Claude Code with Opus 4.8 MAX; identical compute/time budgets across arms.
- A (plain): web search only, no rubric/skills.
- B (+skills + public tools): default_14dim playbook (planning, public APIs: ClinicalTrials.gov, PubMed, OpenTargets, OpenFDA), citation verifier, objectivity charter, adversarial red‑team; proprietary data starved.
- C (+data): same as B but with Noah AI proprietary exports queried first.
- Noah AI corpus statistics (available to C): ~5,322 programs, 16,547 trials, 476 deals; target‑scoped slices used for competition queries.
- Benchmark: 13 assets, 5 strata chosen to expose different expected failure modes (coverage vs calibration). 28–30 scored asset–seed cells per arm after exclusions.
- Gold keys: constructed from the proprietary commercial‑intelligence family used by C and audited/enriched by industry experts. Purposefully measures recovery of a curated proprietary diligence record (R_gold), not an independent ground truth for long‑term clinical payoffs.
- Scoring:
- Auto metrics based on deterministic matchers (name/code based) and capability‑superset unioning for factual capacity.
- Blind human judging: 3 independent judges per report, anonymized and shuffled across conditions; judges scored objectivity (eight laws: 0–4 each), key‑fact coverage, and decision‑quality (0–10). Reports were scrubbed of provenance to blind B vs. C where possible; A’s free‑form format was partially identifiable and interpreted alongside auto metrics.
- Informed decision‑quality = blind-panel DQ × R_gold; additional utility proxies (hard ceiling 10×R_gold, pessimistic min proxy, geometric proxy) tested robustness.
- Sensitivity: paired analysis on fully paired cells (n=27) preserves the C advantage under both raw and capacity recall accounting.
- Limitations explicitly noted by authors: gold set is not an independent ground truth (it was built from the proprietary corpus); small asset sample; single backbone model; web_extra was empty in this study so R_union = R_gold.
Implications for AI Economics
- Value of proprietary data assets
- Proprietary curated evidence yields large marginal returns in decision‑utility for high‑stakes knowledge work. Firms owning such datasets can convert model-level competence into substantially higher actionable value.
- Investment in curation and long‑tail ingestion (regional, preclinical, unregistered sources) likely yields outsized economic returns relative to equivalent investment in reasoning scaffolds alone.
- Market structure and product strategy
- Bundling models with proprietary datasets is a defensible competitive strategy: reasoning advances are necessary but not sufficient for product differentiation in knowledge‑intensive domains.
- Licensing/custody of curated commercial intelligence becomes a durable moat — difficulty of replicating long‑tail coverage implies persistent advantage and potential for premium pricing.
- Evaluation and benchmarking practices
- Standard LLM/agent evaluations that focus on model/prompt/playbook improvements without explicitly controlling data access risk conflating data and reasoning effects. Ablation by data access (as done here) should be routine for domain applications.
- Benchmarks for scientific/commercial tasks should incorporate completeness-aware metrics (informed‑DQ or equivalent) rather than only prose‑level quality measures.
- Policy and welfare considerations
- Concentration of curated domain evidence in a few proprietary datasets can create asymmetries in decision quality across organizations, affecting competition in biotech/healthcare markets.
- Public‑interest justification for funding open, high‑quality curated corpora (or standards for data access in regulated decisions) becomes stronger when downstream decision utility materially depends on coverage.
- Productization & cost tradeoffs
- Even if cost per token/model call favors public APIs, total utility depends on coverage; firms should weigh licensing curation costs against the marginal loss in informed decision utility and associated economic risk.
- Automation of due diligence and valuation is feasible, but realistically requires paying for curated evidence to reach near‑human (complete) performance rather than relying on better prompting or scaffolding alone.
- Research directions relevant to economic agents
- Quantify cost‑benefit: measure ROI of proprietary curation vs. gains from improved models or playbooks.
- Study generality: replicate across other high‑stakes domains (e.g., M&A, litigation, materials) to map where data vs reasoning dominates.
- Policy research: analyze market power implications and potential interventions (data sharing norms, public datasets) that affect welfare.
Caveats and limits to generalization: the “gold” here is derived from the proprietary corpus and expert curation, so the experiment specifically measures recovery of that commercial diligence record. Results strongly support the claim that curated proprietary evidence materially limits achievable decision utility in this task, but external validation with independent ground truth (long‑term outcomes) and larger, more diverse benchmarks would strengthen external generalizability.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| For knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. Decision Quality | positive | high | AI Scientist capability limited by available evidence substrate (affecting decision quality) |
n=13
0.48
|
| We ran a controlled three-arm ablation on a production valuation agent: A = plain web-only LLM analyst; B = adds public structured tools + a 14-dimension valuation playbook, verifier, objectivity policy and red-team; C = adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Other | neutral | high | experimental treatment definitions (method) |
n=13
0.8
|
| Across a 13-asset stratified benchmark, adding the public structured tools and playbook (B) improves tier-in-range accuracy from 0.80 to 0.89 (vs. A). Decision Quality | positive | high | tier-in-range accuracy |
n=13
0.80 to 0.89
0.48
|
| Across the same benchmark, B increases objectivity from 3.16 to 3.30 (vs. A). Decision Quality | positive | high | objectivity score |
n=13
3.16 to 3.30
0.48
|
| Under capability-superset accounting on the curated gold competitive record, agent A recovers only 0.25, agent B recovers 0.38, while agent C recovers 0.96 (overall). Decision Quality | mixed | high | fraction of curated gold competitive record recovered (gold-coverage) |
n=13
A: 0.25; B: 0.38; C: 0.96
0.48
|
| On the curated long-tail subset, agent C reaches 0.93 recovered coverage vs. A 0.26 and B 0.30. Decision Quality | positive | high | fraction of curated gold record recovered on long-tail subset (gold-coverage) |
C: 0.93; A: 0.26; B: 0.30
0.48
|
| Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96). Decision Quality | null_result | high | raw blind-panel decision-quality score |
7.01 vs. 6.96
0.48
|
| Introducing completeness-aware decision utility (informed decision-quality = decision-quality × gold-coverage), C reaches 7.43 vs. A 1.76 and B 2.57 on this metric. Decision Quality | positive | high | informed decision-quality (decision-quality × gold-coverage) |
n=13
C: 7.43; A: 1.76; B: 2.57
0.48
|
| Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage (i.e., B imposes an upper bound on non-proprietary informed decision-quality). Decision Quality | negative | high | maximum achievable informed decision-quality for non-proprietary-data reports under B's coverage |
3.83
0.48
|
| Reasoning scaffolds (public tools, playbook, verifier, objectivity policy, red-team) improve calibration and audit discipline, but proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide. Decision Quality | mixed | high | calibration/audit discipline improvements vs. upper bound of knowledge/decision capability |
n=13
0.48
|