A curated drug-asset index dramatically improves LLM-powered competitive scouting: Gosset finds 3.2× more verified long-tail oncology assets per query than leading web-search models in a 10-target benchmark. Turning the same curated index into a callable tool lets other frontier models recover most of that advantage, implying search/data coverage — not model bells and whistles — is the main bottleneck.
General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. We benchmark Gosset -- an AI platform with a chat interface backed by curated target-, modality-, and indication-level drug-asset annotations -- against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. All five systems receive the same natural-language query and the same JSON output schema. Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs. The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool, suggesting that each of these systems can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface.
Summary
Main Finding
Gosset — an AI chat front end backed by a curated drug-asset index — substantially outperforms four frontier LLMs with live web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) for enumerating niche pharma assets. Across 10 niche oncology/immunology targets, Gosset returned 451 verified drug programs (100% precision and 100% recall relative to the cross-system union), roughly 3.2× more verified drugs than the best frontier system (GPT + web, 140). The paper argues the gap is an index problem (coverage of long-tail, early-stage assets), not a reasoning problem, and that frontier models can largely close the gap by calling the same curated index as a tool (via an MCP server).
Key Points
- Experiment scope: 10 niche targets with long preclinical/early-clinical tails — TL1A, OX40L, IL-36R, TROP-2, B7‑H3, ROR1, NaPi2b, Claudin 18.2, FAP, GPRC5D.
- Controlled test: identical natural‑language prompt and JSON output schema given to all systems.
- Systems compared:
- Gosset (curated index; no live web; sub-second responses)
- Claude + web (Anthropic)
- GPT + web (OpenAI/GPT-5.5)
- Gemini + web (Google)
- Perplexity sonar-pro
- Aggregate (human-reviewed) results:
- Gosset: 451 verified, 0 hallucinated — Precision 1.000, Recall (vs union) 1.000, median latency 1.2 s
- GPT + web: 140 verified, 0 hallucinated — Precision 1.000, Recall 0.310, median latency 804.8 s
- Claude + web: 124 verified, 1 hallucinated — Precision 0.992, Recall 0.275, median latency 103.1 s
- Gemini + web: 109 verified, 0 hallucinated — Precision 1.000, Recall 0.242, median latency 106.8 s
- Perplexity: 76 verified, 2 hallucinated — Precision 0.974, Recall 0.169, median latency 15.3 s
- Main behavioral pattern: frontier LLMs handle late-stage, press-covered assets well (anchor drugs), but miss large parts of the long tail (early-stage, regional, academic programs). Gosset’s curated index captures that long tail.
- Validation pipeline: deterministic auto-pass for assets with industry-grade evidence; three-LLM judge triage (Claude/GPT/Gemini) for residuals; final human expert sign-off. Alias-aware deduplication used for scoring.
- Latency advantage: Gosset ~1–2 s per query vs minutes–10+ minutes for many web-augmented LLMs.
- Reproducibility/transfer claim: Gosset exposes the index as an MCP server; frontier LLMs should close most of the recall gap by calling that curated index as a tool.
Data & Methods
- Targets: ten niche oncology/immunology targets chosen for “long-tail” pipelines.
- Prompt & output: identical NL prompt and structured JSON schema {name, sponsor, modality, phase, indication} required of all systems; aliases list required to enable alias-aware union-find deduplication.
- Systems under test: Gosset (curated index), Claude/GPT/Gemini/Perplexity with their respective web-search tooling (20-search budgets typically).
- Scoring definitions:
- Verified (Vs): drugs confirmed by the validation pipeline
- Hallucinated (Hs): claims judged not real or off-target after review
- Precision = Vs / (Vs + Hs)
- Recall (proxy) = Vs / |union of verified drugs across all systems|
- Validation pipeline:
- Deterministic auto-pass: assets with curated clinical-trial / approval evidence auto-verified.
- Three-AI-judge cross-check (Claude/GPT/Gemini) to triage uncertain cases (2-of-3 majority).
- Human expert final sign-off, focusing on disagreements, ambiguous target attributions, alias collisions, indirect evidence.
- Limitations acknowledged by authors:
- The recall measure is relative to the discoverable universe (cross-system union), not the absolute, possibly partially undisclosed pipeline.
- Targets biased toward areas with stronger Gosset coverage; results may narrow on heavily publicized targets (PD‑1, HER2).
- LLM judges have calibration limits; human review still required.
- Some programs remain invisible if never publicly disclosed.
Implications for AI Economics
- Value of proprietary curated data assets:
- The experiment demonstrates that ownership of high‑coverage, domain-specific indexes (curated clinical/trial/program metadata) materially improves recall for tasks with long-tail items. Firms that curate such indexes can capture disproportionate value in downstream workflows (scouting, licensing, BD, valuation).
- The economic rents shift from model sophistication to proprietary data and curation effort for niche, high‑value verticals.
- Product design: tool composition beats monolithic web grounding
- The result supports a compositional product strategy: pair a general LLM (strengths: language understanding, summarization, reasoning) with specialized retrieval indexes exposed via standardized tooling (e.g., MCP). This separation is modular and commoditizes model upgrades while preserving data differentiation.
- Labour productivity and transaction speed
- Faster, more-complete enumeration (Gosset: ~1s; frontier: minutes) reduces analyst time per query, accelerates deal sourcing/triage, and lowers time-to-decision — increasing throughput and reducing transaction costs in BD, M&A, licensing, and portfolio management.
- Market structure and competition
- Providers of curated vertical indexes (clinical assets, patents, regional registries) can become bottlenecks or gatekeepers, creating opportunities for market concentration and supplier lock-in unless access becomes standardized/competitive.
- Standard APIs (MCP-style) reduce integration costs, enabling LLM vendors to plug into multiple indexes; nonetheless, quality-of-index remains a key differentiator.
- Pricing and monetization
- Firms can monetize curated indexes via per-call pricing, subscriptions, or bundled services. Buyers should compare the marginal ROI of (a) paying for curated access vs (b) relying on general web-augmented LLMs plus human search.
- Impacts on valuation and deal flow
- Improved recall of long-tail assets changes how early-stage pipelines are discovered and valued. Better coverage can surface previously overlooked assets, affecting deal sourcing, competitive intelligence, and expected returns on scouting investments.
- Policy and externalities
- As curated indexes concentrate, concerns arise about transparency, contestability, and auditability (how/why an asset is included). Regulatory scrutiny could appear in contexts where decision-making (drug development portfolios, investment choices) depends on proprietary indexes.
- Practical takeaways for firms and economists
- Invest in curated, high‑quality vertical indexes where long-tail coverage matters; expose them via standard tool APIs to interoperate with LLMs.
- When evaluating LLM capabilities for domain tasks, decompose model vs index: improve retrieval/indexing before blaming model reasoning.
- Measure ROI by combining precision/recall improvements with speed gains and downstream financial impact (faster deal closure, higher hit-rate, labor savings).
- Monitor concentration risks and advocate for open benchmarks / interoperability standards to reduce lock-in.
If you want, I can: (a) extract the full numeric table into CSV, (b) draft a short ROI model for an asset‑scouting team comparing curated-index subscription vs analyst-hours, or (c) outline a policy checklist for regulators monitoring proprietary biomedical indexes. Which would be most useful?
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. Adoption Rate | positive | high | usage/adoption of general-purpose LLMs with web search for scouting pharmaceutical pipelines |
0.09
|
| We benchmark Gosset ... against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets. Other | neutral | high | comparative retrieval performance on 10 niche oncology/immunology targets |
n=10
0.18
|
| All five systems receive the same natural-language query and the same JSON output schema. Other | neutral | high | consistency of input/query and output schema across systems |
0.18
|
| Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system. Output Quality | positive | high | number of verified drugs returned per query |
n=10
3.2x more
0.18
|
| Gosset achieved perfect precision and 100% recall against the cross-system union of verified drugs. Output Quality | positive | high | precision and recall of verified drug retrieval |
n=10
precision = 100%, recall = 100%
0.18
|
| The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool. Other | neutral | high | availability of curated index as callable MCP server |
0.18
|
| Exposing the curated index as a callable tool suggests that each frontier system can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface. Output Quality | positive | high | change in recall gap when using a curated index versus generic web search |
close most of the recall gap
0.03
|
| The 10 selected targets are niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. Other | neutral | medium | distribution of pipeline assets for selected targets (preclinical, Asian-developed concentration) |
n=10
0.05
|