A curated drug-asset index dramatically improves LLM-powered competitive scouting: Gosset finds 3.2× more verified long-tail oncology assets per query than leading web-search models in a 10-target benchmark. Turning the same curated index into a callable tool lets other frontier models recover most of that advantage, implying search/data coverage — not model bells and whistles — is the main bottleneck.

Curated AI beats frontier LLMs at pharma asset discovery

Łukasz Kidziński, Kevin Thomas · May 06, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A curated, annotation-backed search index plus a chat interface (Gosset) retrieves 3.2× more verified long-tail oncology/immunology drug assets per query than four leading web-search LLMs, achieving perfect precision and 100% recall on a 10-target benchmark, and making that index callable lets other models close most of the recall gap.

General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. We benchmark Gosset -- an AI platform with a chat interface backed by curated target-, modality-, and indication-level drug-asset annotations -- against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. All five systems receive the same natural-language query and the same JSON output schema. Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs. The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool, suggesting that each of these systems can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface.

Summary

Main Finding

Gosset — an AI chat front end backed by a curated drug-asset index — substantially outperforms four frontier LLMs with live web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) for enumerating niche pharma assets. Across 10 niche oncology/immunology targets, Gosset returned 451 verified drug programs (100% precision and 100% recall relative to the cross-system union), roughly 3.2× more verified drugs than the best frontier system (GPT + web, 140). The paper argues the gap is an index problem (coverage of long-tail, early-stage assets), not a reasoning problem, and that frontier models can largely close the gap by calling the same curated index as a tool (via an MCP server).

Key Points

Experiment scope: 10 niche targets with long preclinical/early-clinical tails — TL1A, OX40L, IL-36R, TROP-2, B7‑H3, ROR1, NaPi2b, Claudin 18.2, FAP, GPRC5D.
Controlled test: identical natural‑language prompt and JSON output schema given to all systems.
Systems compared:
- Gosset (curated index; no live web; sub-second responses)
- Claude + web (Anthropic)
- GPT + web (OpenAI/GPT-5.5)
- Gemini + web (Google)
- Perplexity sonar-pro
Aggregate (human-reviewed) results:
- Gosset: 451 verified, 0 hallucinated — Precision 1.000, Recall (vs union) 1.000, median latency 1.2 s
- GPT + web: 140 verified, 0 hallucinated — Precision 1.000, Recall 0.310, median latency 804.8 s
- Claude + web: 124 verified, 1 hallucinated — Precision 0.992, Recall 0.275, median latency 103.1 s
- Gemini + web: 109 verified, 0 hallucinated — Precision 1.000, Recall 0.242, median latency 106.8 s
- Perplexity: 76 verified, 2 hallucinated — Precision 0.974, Recall 0.169, median latency 15.3 s
Main behavioral pattern: frontier LLMs handle late-stage, press-covered assets well (anchor drugs), but miss large parts of the long tail (early-stage, regional, academic programs). Gosset’s curated index captures that long tail.
Validation pipeline: deterministic auto-pass for assets with industry-grade evidence; three-LLM judge triage (Claude/GPT/Gemini) for residuals; final human expert sign-off. Alias-aware deduplication used for scoring.
Latency advantage: Gosset ~1–2 s per query vs minutes–10+ minutes for many web-augmented LLMs.
Reproducibility/transfer claim: Gosset exposes the index as an MCP server; frontier LLMs should close most of the recall gap by calling that curated index as a tool.

Data & Methods

Targets: ten niche oncology/immunology targets chosen for “long-tail” pipelines.
Prompt & output: identical NL prompt and structured JSON schema {name, sponsor, modality, phase, indication} required of all systems; aliases list required to enable alias-aware union-find deduplication.
Systems under test: Gosset (curated index), Claude/GPT/Gemini/Perplexity with their respective web-search tooling (20-search budgets typically).
Scoring definitions:
- Verified (Vs): drugs confirmed by the validation pipeline
- Hallucinated (Hs): claims judged not real or off-target after review
- Precision = Vs / (Vs + Hs)
- Recall (proxy) = Vs / |union of verified drugs across all systems|
Validation pipeline:
Deterministic auto-pass: assets with curated clinical-trial / approval evidence auto-verified.
Three-AI-judge cross-check (Claude/GPT/Gemini) to triage uncertain cases (2-of-3 majority).
Human expert final sign-off, focusing on disagreements, ambiguous target attributions, alias collisions, indirect evidence.
Limitations acknowledged by authors:
- The recall measure is relative to the discoverable universe (cross-system union), not the absolute, possibly partially undisclosed pipeline.
- Targets biased toward areas with stronger Gosset coverage; results may narrow on heavily publicized targets (PD‑1, HER2).
- LLM judges have calibration limits; human review still required.
- Some programs remain invisible if never publicly disclosed.

Implications for AI Economics

Value of proprietary curated data assets:
- The experiment demonstrates that ownership of high‑coverage, domain-specific indexes (curated clinical/trial/program metadata) materially improves recall for tasks with long-tail items. Firms that curate such indexes can capture disproportionate value in downstream workflows (scouting, licensing, BD, valuation).
- The economic rents shift from model sophistication to proprietary data and curation effort for niche, high‑value verticals.
Product design: tool composition beats monolithic web grounding
- The result supports a compositional product strategy: pair a general LLM (strengths: language understanding, summarization, reasoning) with specialized retrieval indexes exposed via standardized tooling (e.g., MCP). This separation is modular and commoditizes model upgrades while preserving data differentiation.
Labour productivity and transaction speed
- Faster, more-complete enumeration (Gosset: ~1s; frontier: minutes) reduces analyst time per query, accelerates deal sourcing/triage, and lowers time-to-decision — increasing throughput and reducing transaction costs in BD, M&A, licensing, and portfolio management.
Market structure and competition
- Providers of curated vertical indexes (clinical assets, patents, regional registries) can become bottlenecks or gatekeepers, creating opportunities for market concentration and supplier lock-in unless access becomes standardized/competitive.
- Standard APIs (MCP-style) reduce integration costs, enabling LLM vendors to plug into multiple indexes; nonetheless, quality-of-index remains a key differentiator.
Pricing and monetization
- Firms can monetize curated indexes via per-call pricing, subscriptions, or bundled services. Buyers should compare the marginal ROI of (a) paying for curated access vs (b) relying on general web-augmented LLMs plus human search.
Impacts on valuation and deal flow
- Improved recall of long-tail assets changes how early-stage pipelines are discovered and valued. Better coverage can surface previously overlooked assets, affecting deal sourcing, competitive intelligence, and expected returns on scouting investments.
Policy and externalities
- As curated indexes concentrate, concerns arise about transparency, contestability, and auditability (how/why an asset is included). Regulatory scrutiny could appear in contexts where decision-making (drug development portfolios, investment choices) depends on proprietary indexes.
Practical takeaways for firms and economists
- Invest in curated, high‑quality vertical indexes where long-tail coverage matters; expose them via standard tool APIs to interoperate with LLMs.
- When evaluating LLM capabilities for domain tasks, decompose model vs index: improve retrieval/indexing before blaming model reasoning.
- Measure ROI by combining precision/recall improvements with speed gains and downstream financial impact (faster deal closure, higher hit-rate, labor savings).
- Monitor concentration risks and advocate for open benchmarks / interoperability standards to reduce lock-in.

If you want, I can: (a) extract the full numeric table into CSV, (b) draft a short ROI model for an asset‑scouting team comparing curated-index subscription vs analyst-hours, or (c) outline a policy checklist for regulators monitoring proprietary biomedical indexes. Which would be most useful?

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports large, clear performance differences on concrete metrics (verified drugs per query, precision, recall) using a controlled prompt and output schema, but the benchmark is small (10 targets), domain-specific (niche oncology/immunology, long-tail preclinical/Asian assets), uses the cross-system union as the verification baseline (which may be incomplete or circular), and lacks independent external validation or statistical tests, limiting confidence in generality. Methods Rigormedium — Strengths: the study compares five systems under the same natural-language query and identical JSON output schema, and includes an ablation-like probe by exposing the curated index as a callable tool. Weaknesses: small sample of targets, unclear procedure for verifying 'verified drugs' (ground-truth construction), potential selection bias toward assets that favor a curated index, no report of inter-annotator checks or statistical significance, and possible conflicts of interest if authors are affiliated with Gosset. SampleBenchmark of five chat-based LLM systems (Gosset with a curated target/modality/indication index; Claude Opus 4.7; GPT 5.5; Gemini 3.1 Pro; Perplexity sonar-pro) evaluated on 10 niche oncology/immunology targets where much of the pipeline is long-tail preclinical and Asian-developed assets; same natural-language query and JSON output schema for all systems; evaluation metric: number of 'verified' drug assets returned per query, precision and recall measured against the cross-system union of verified drugs; additional test exposing curated index as a callable MCP tool for other models. Themesproductivity adoption GeneralizabilitySmall sample size (10 targets) limits statistical generality., Targets are niche oncology/immunology—results may not hold for other therapeutic areas or broader drug pipelines., Performance likely depends on the quality and coverage of the curated index; gains may not generalize where such indices are unavailable., Using the cross-system union as the verification baseline may bias recall/precision estimates and is not an independent ground truth., Model versions and web-search capabilities change rapidly; results may not hold for future or different model configurations., Possible geographic/language bias (Asian-developed assets) may limit applicability to regions with different reporting practices.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. Adoption Rate	positive	high	usage/adoption of general-purpose LLMs with web search for scouting pharmaceutical pipelines	0.09
We benchmark Gosset ... against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets. Other	neutral	high	comparative retrieval performance on 10 niche oncology/immunology targets	n=10 0.18
All five systems receive the same natural-language query and the same JSON output schema. Other	neutral	high	consistency of input/query and output schema across systems	0.18
Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system. Output Quality	positive	high	number of verified drugs returned per query	n=10 3.2x more 0.18
Gosset achieved perfect precision and 100% recall against the cross-system union of verified drugs. Output Quality	positive	high	precision and recall of verified drug retrieval	n=10 precision = 100%, recall = 100% 0.18
The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool. Other	neutral	high	availability of curated index as callable MCP server	0.18
Exposing the curated index as a callable tool suggests that each frontier system can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface. Output Quality	positive	high	change in recall gap when using a curated index versus generic web search	close most of the recall gap 0.03
The 10 selected targets are niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. Other	neutral	medium	distribution of pipeline assets for selected targets (preclinical, Asian-developed concentration)	n=10 0.05