A semantic-search system indexed 166 million pediatric notes and cut clinician chart-abstraction time by a quarter to nearly nine‑tenths while serving queries in under a second at roughly $4,000/month; the deployment demonstrates that health-system‑scale semantic retrieval is technically and operationally feasible but evidence on broader clinical and cross‑site impacts remains limited.

Health System Scale Semantic Search Across Unstructured Clinical Notes

Faith Wavinya Mutinda, Spandana Makeneni, Anna Lin, Shivaji Dutta, Irit R. Rasooly, Patrick Dibussolo, Shivani Kamath Belman, Hessam Shahriari, Kevin Murphy, Alex B. Ruan, Barbara H. Chaiyachati, Sanjay Chainani, Robert W. Grundmeier, Scott M. Haag, Jeffrey M. Miller, Heather M. Griffis, Ian M. Campbell · April 28, 2026

arxiv quasi_experimental medium evidence 8/10 relevance Source PDF

A health-system-scale semantic search deployed on 166 million pediatric notes achieved sub-second query latency, low operating cost (~$4,000/month), high retrieval accuracy, and cut clinician chart-abstraction time by 24–89% compared with manual review.

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.

Summary

Main Finding

A health-system–scale semantic search system indexing 166 million clinical notes (484 million embedding vectors) is technically and operationally feasible: it returns relevant results with sub-second latency, costs ≈USD 4,000/month to operate (storage-optimized), achieves high retrieval quality (≈94–95% accuracy on a physician-authored QA benchmark), and materially improves chart abstraction efficiency (time reductions of 24–89%) while maintaining comparable inter-rater agreement to traditional EHR review.

Key Points

Scale and scope
- Indexed 166 million clinical notes from 1.68 million patients across 80+ specialties; produced 484 million chunks/vectors.
Embedding & chunking
- Selected qwen3-embedding-0.6B (1024-dim, instruction tuned) with 300-token chunks and 50-token overlap as production.
- Chunking prioritized linguistic boundaries to keep concepts coherent.
Retrieval quality
- Physician-authored CHOP_MCQA_v0.5 benchmark (334 questions, 322 patients).
- End-to-end accuracy: 95.5% on reduced index; 94.6% on full index (no significant degradation at scale).
- Main failure modes: temporal mismatches and noisy note types (addressable with metadata/date filters).
Performance & cost
- Median end-to-end latency: 636 ms (single-user); decomposition: 394 ms embedding (CPU), 237 ms vector search, 5 ms key-value lookup.
- Vector search median latency rose to 451 ms at 20 concurrent queries; remained under 1 second at moderate concurrency.
- Monthly operational cost (storage-optimized deployment): ≈USD 4,021 (Vector Search ≈3,420; Bigtable ≈593; networking ≈7). One-time index build cost ≈USD 891. Embedding compute was provided gratis via TPU Research Cloud (TRC) for this project.
- In-memory hosting is far costlier: a small 8% in-memory slice cost ≈USD 8,000/month; full in-memory estimated ≈USD 96,000/month.
Clinical utility
- Three abstraction tasks showed consistent time savings: genetic diagnosis (24% faster), age at first seizure (52% faster), cohort discovery for ballet-related foot injuries (89% faster).
- Inter-rater agreement comparable between semantic search and traditional EHR abstraction (high Fleiss’ κ and Krippendorff’s α; non-significant differences).
Governance & security
- Deployed in HIPAA-compliant environment (Arcus on Google Cloud) with project-level allowlists, audit logging, and exclusions for specially protected notes.

Data & Methods

Data
- Source: CHOP EHR notes Jan 2000–Sep 2025; 166M notes retained after exclusions; metadata includes demographics, note/author attributes, timestamps.
Indexing pipeline
- Extraction → chunking (300-token chunks, 50-token overlap, linguistic boundary-aware) → embedding (qwen3-embedding-0.6B; last-token pooling; L2 normalization) → insert into managed vector DB (Vertex AI Vector Search, storage-optimized) → full text & metadata in Bigtable keyed by note ID.
- Embeddings computed on Google Cloud TPUs (PyTorch/XLA); 11 calendar days for full corpus via TPU Research Cloud allocations.
Vector DB & retrieval
- Uses ScaNN + SOAR approximate nearest-neighbor search with dot-product similarity; supports categorical and numeric filtering (patient, note type, date ranges, etc.); incremental insertion supported.
Benchmarking
- CHOP_MCQA_v0.5: 334 multiple-choice, patient-specific Qs with 4 distractors; retrieval limited to target patient during evaluation; retrieved top-20 chunks fed to fixed LLM (DeepSeek-R1-Distill-Llama-70B) to select answers; accuracy measured as end-to-end proxy for retrieval quality.
Clinical utility study
- Five clinician abstractors. Three tasks: 1) identify documented genetic conditions (20 patients), 2) age at first lifetime seizure (20 patients), 3) cohort discovery for ballet-related foot injuries (semantic search vs SQL list of 100 candidates).
- Primary outcome: time-to-completion per patient (Mann–Whitney U test). Inter-rater reliability: Fleiss’ κ, Cohen’s κ, Krippendorff’s α; bootstrap tests for cross-method vs within-method differences.
Governance & compliance
- Project-level containerized deployments, allowlists to enforce IRB-approved access, exclusion of specially protected PHI upstream, full audit logging.

Implications for AI Economics

Cost structure and trade-offs
- Persistent (fixed) monthly operating cost is low relative to potential alternatives if using storage-optimized hosting (≈USD 4k/month). In-memory hosting yields much higher recurring costs (orders of magnitude more), so choice of deployment tier materially affects economics.
- Embedding computation is a substantial one-time cost if not subsidized (this deployment used TRC); organizations should budget significant upfront compute for first build and for re-embedding on major model updates.
- The marginal cost per query is low (operational costs largely independent of query volume), making frequent interactive use economically attractive.
Productivity and labor economics
- Measurable time savings across abstraction tasks (median reductions of 24–89%) translate directly to reduced researcher/clinician time per chart. Economically, even modest per-patient time savings can offset monthly hosting costs when scaled across many abstraction tasks or a high volume of chart reviews.
- Simple illustrative breakeven (example): if average per-patient time saved = 1 minute, and clinician labor cost = USD 100/hour (~USD 1.67/min), saving USD 1.67 per patient implies ≈2,408 patient abstractions/month to offset USD 4,021 hosting cost. Replace inputs (time saved, wage) to get context-specific breakeven.
Capital vs operational budgeting
- Two-tier cost profile: upfront embedding/build (capital, compute-heavy) and ongoing serving/storage (operational). Institutions may finance the build via research credits or one-time budgets, but should plan recurring operating budgets for serving, storage, and governance.
Vendor choice and lock-in
- Reliance on managed vector DB (Vertex AI Vector Search) and Bigtable simplifies ops and reduces engineering overhead but creates vendor dependency and cost opacity. Economic decisions should weigh lower engineering costs vs long-term vendor pricing risk.
Model choice and efficiency
- Smaller, instruction-tuned embeddings (qwen3-0.6B) delivered high retrieval quality while limiting vector dimensionality and storage footprint. This highlights a key economic trade-off: marginal gains from larger models may not justify increased storage/compute costs. Instruction tuning and design matter more than sheer model size.
Externalities and governance costs
- Compliance, IRB processes, allowlists, PHI exclusions, and audit logging impose nontrivial governance overhead. Institutions must account for legal/compliance staffing and monitoring as part of the total cost of ownership.
Value multipliers
- Beyond direct time savings, a shared semantic index enables downstream RAG applications, cohort generation, quality improvement, and faster research cycles—potentially amplifying ROI.
Risks and ongoing costs
- Need for re-embedding and re-indexing over time due to model updates, new notes, and distributional changes; ongoing validation to monitor retrieval quality; costs associated with incident response, external audits, or security upgrades.
Market implications
- The demonstrated economics (low monthly serving cost with high utility) suggest an addressable market for institutional semantic search offerings and managed services tailored to healthcare systems. Vendors and health systems should compete on price-performance, governance features, and strategies to reduce upfront embedding costs (e.g., pooled compute credits, federated sharing of embeddings, or smaller instruction-tuned models).
Policy and labor implications
- Productivity gains may shift clinician/researcher workflows toward higher-value tasks, but workforce planning should anticipate shifting demand for chart abstraction roles, informatics support, and governance specialists.

Limitations and caveats relevant to economics - Reported operational costs exclude the full market cost of embedding computation (TRC credits used); organizations must budget that expense unless similar credits are available. - Results are from a single pediatric health system; generalizability and demand elasticities across other health systems may vary. - Performance and cost trade-offs depend on vendor pricing, regional cloud pricing, and architecture choices; local procurement terms can materially change the economics.

If you want, I can: - Run a breakeven calculator for your institution using custom values (expected chart reviews/month, average time saved per review, clinician hourly rate, one-time embedding cost). - Produce a cost sensitivity analysis comparing storage-optimized vs in-memory deployments and different embedding dimensionalities.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper reports a full-system deployment on real-world data (166M notes) and quantifies latency, cost, and benchmark retrieval accuracy, and it measures clinician task time reductions (24–89%), giving direct evidence of productivity gains; however, the clinical utility evaluation appears non-randomized and limited to a small set of abstraction tasks at a single pediatric health system, raising concerns about selection, learning/order effects, and external validity. Methods Rigormedium — Engineering evaluation is rigorous (large-scale indexing, model/ chunking optimization, detailed operational metrics). The user/productivity evaluation lacks detailed methodological description (sample sizes, randomization, blinding, task assignment, or statistical controls), limiting internal validity and reproducibility; no long-term usage or downstream clinical outcome analyses are reported. SampleSystem indexed 166 million clinical notes (484 million vectors) from 1.68 million patients at a single large children's hospital; embeddings used: instruction-tuned qwen3-embedding-0.6B with 300-token chunks; model/chunking optimized on a physician-authored clinical benchmark; clinical utility assessed via three chart-abstraction tasks performed by clinicians (sample sizes and clinician characteristics not specified). Themesproductivity human_ai_collab adoption IdentificationComparison of clinician chart-abstraction tasks performed with and without access to the deployed semantic search system; optimization experiments used a physician-authored benchmark to select embedding model and chunk size; full-scale performance characterized by system metrics (latency, cost, retrieval quality) rather than randomized assignment. No randomized or instrumental-variable identification is reported, so causal claims rely on controlled task comparisons. GeneralizabilitySingle-site study at a pediatric hospital limits applicability to adult hospitals or other health systems, Findings dependent on specific embedding model (qwen3-embedding-0.6B) and chosen chunk size; other models or languages may perform differently, Clinical tasks evaluated were limited to chart abstraction; benefits may not translate to other clinical workflows or downstream clinical outcomes, Operational cost estimates reflect local cloud and system design choices and may not generalize across regions or procurement models, HIPAA-compliant governance and technical stack assumptions may not hold in non-US jurisdictions or smaller institutions

Claims (10)

Claim	Direction	Confidence	Outcome	Details
We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. Other	null_result	high	number_of_notes_indexed / index_size	n=166000000 166 million clinical notes; 484 million vectors; 1.68 million patients 0.8
The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. Other	null_result	high	system_architecture / governance_compliance	0.8
The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset. Other	null_result	high	model_and_chunking_configuration	0.48
The system delivers sub-second query latency: median 237 ms single-user, 451 ms at 20-user concurrency. Organizational Efficiency	positive	high	query_latency	median 237 ms single-user, 451 ms 20-user concurrency 0.48
Monthly operational cost of running the system is approximately USD 4,000. Organizational Efficiency	negative	high	monthly_operational_cost	monthly costs of approximately USD 4,000 0.48
Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. Output Quality	positive	high	accuracy_on_clinical_question_answering_benchmark	94.6% accuracy 0.48
In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review. Task Completion Time	positive	high	time-to-completion	24 to 89% reduction 0.48
Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time. Output Quality	null_result	high	inter-rater_agreement	comparable inter-rater agreement 0.48
Health-system-scale semantic search is both technically and operationally feasible. Adoption Rate	positive	medium	feasibility_of_health_system_scale_deployment	0.29
The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise. Adoption Rate	positive	medium	need_for_specialized_informatics_expertise / applicability_to_downstream_applications	0.05