A semantic-search system indexed 166 million pediatric notes and cut clinician chart-abstraction time by a quarter to nearly nine‑tenths while serving queries in under a second at roughly $4,000/month; the deployment demonstrates that health-system‑scale semantic retrieval is technically and operationally feasible but evidence on broader clinical and cross‑site impacts remains limited.
Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.
Summary
Main Finding
A health-system–scale semantic search system indexing 166 million clinical notes (484 million embedding vectors) is technically and operationally feasible: it returns relevant results with sub-second latency, costs ≈USD 4,000/month to operate (storage-optimized), achieves high retrieval quality (≈94–95% accuracy on a physician-authored QA benchmark), and materially improves chart abstraction efficiency (time reductions of 24–89%) while maintaining comparable inter-rater agreement to traditional EHR review.
Key Points
- Scale and scope
- Indexed 166 million clinical notes from 1.68 million patients across 80+ specialties; produced 484 million chunks/vectors.
- Embedding & chunking
- Selected qwen3-embedding-0.6B (1024-dim, instruction tuned) with 300-token chunks and 50-token overlap as production.
- Chunking prioritized linguistic boundaries to keep concepts coherent.
- Retrieval quality
- Physician-authored CHOP_MCQA_v0.5 benchmark (334 questions, 322 patients).
- End-to-end accuracy: 95.5% on reduced index; 94.6% on full index (no significant degradation at scale).
- Main failure modes: temporal mismatches and noisy note types (addressable with metadata/date filters).
- Performance & cost
- Median end-to-end latency: 636 ms (single-user); decomposition: 394 ms embedding (CPU), 237 ms vector search, 5 ms key-value lookup.
- Vector search median latency rose to 451 ms at 20 concurrent queries; remained under 1 second at moderate concurrency.
- Monthly operational cost (storage-optimized deployment): ≈USD 4,021 (Vector Search ≈3,420; Bigtable ≈593; networking ≈7). One-time index build cost ≈USD 891. Embedding compute was provided gratis via TPU Research Cloud (TRC) for this project.
- In-memory hosting is far costlier: a small 8% in-memory slice cost ≈USD 8,000/month; full in-memory estimated ≈USD 96,000/month.
- Clinical utility
- Three abstraction tasks showed consistent time savings: genetic diagnosis (24% faster), age at first seizure (52% faster), cohort discovery for ballet-related foot injuries (89% faster).
- Inter-rater agreement comparable between semantic search and traditional EHR abstraction (high Fleiss’ κ and Krippendorff’s α; non-significant differences).
- Governance & security
- Deployed in HIPAA-compliant environment (Arcus on Google Cloud) with project-level allowlists, audit logging, and exclusions for specially protected notes.
Data & Methods
- Data
- Source: CHOP EHR notes Jan 2000–Sep 2025; 166M notes retained after exclusions; metadata includes demographics, note/author attributes, timestamps.
- Indexing pipeline
- Extraction → chunking (300-token chunks, 50-token overlap, linguistic boundary-aware) → embedding (qwen3-embedding-0.6B; last-token pooling; L2 normalization) → insert into managed vector DB (Vertex AI Vector Search, storage-optimized) → full text & metadata in Bigtable keyed by note ID.
- Embeddings computed on Google Cloud TPUs (PyTorch/XLA); 11 calendar days for full corpus via TPU Research Cloud allocations.
- Vector DB & retrieval
- Uses ScaNN + SOAR approximate nearest-neighbor search with dot-product similarity; supports categorical and numeric filtering (patient, note type, date ranges, etc.); incremental insertion supported.
- Benchmarking
- CHOP_MCQA_v0.5: 334 multiple-choice, patient-specific Qs with 4 distractors; retrieval limited to target patient during evaluation; retrieved top-20 chunks fed to fixed LLM (DeepSeek-R1-Distill-Llama-70B) to select answers; accuracy measured as end-to-end proxy for retrieval quality.
- Clinical utility study
- Five clinician abstractors. Three tasks: 1) identify documented genetic conditions (20 patients), 2) age at first lifetime seizure (20 patients), 3) cohort discovery for ballet-related foot injuries (semantic search vs SQL list of 100 candidates).
- Primary outcome: time-to-completion per patient (Mann–Whitney U test). Inter-rater reliability: Fleiss’ κ, Cohen’s κ, Krippendorff’s α; bootstrap tests for cross-method vs within-method differences.
- Governance & compliance
- Project-level containerized deployments, allowlists to enforce IRB-approved access, exclusion of specially protected PHI upstream, full audit logging.
Implications for AI Economics
- Cost structure and trade-offs
- Persistent (fixed) monthly operating cost is low relative to potential alternatives if using storage-optimized hosting (≈USD 4k/month). In-memory hosting yields much higher recurring costs (orders of magnitude more), so choice of deployment tier materially affects economics.
- Embedding computation is a substantial one-time cost if not subsidized (this deployment used TRC); organizations should budget significant upfront compute for first build and for re-embedding on major model updates.
- The marginal cost per query is low (operational costs largely independent of query volume), making frequent interactive use economically attractive.
- Productivity and labor economics
- Measurable time savings across abstraction tasks (median reductions of 24–89%) translate directly to reduced researcher/clinician time per chart. Economically, even modest per-patient time savings can offset monthly hosting costs when scaled across many abstraction tasks or a high volume of chart reviews.
- Simple illustrative breakeven (example): if average per-patient time saved = 1 minute, and clinician labor cost = USD 100/hour (~USD 1.67/min), saving USD 1.67 per patient implies ≈2,408 patient abstractions/month to offset USD 4,021 hosting cost. Replace inputs (time saved, wage) to get context-specific breakeven.
- Capital vs operational budgeting
- Two-tier cost profile: upfront embedding/build (capital, compute-heavy) and ongoing serving/storage (operational). Institutions may finance the build via research credits or one-time budgets, but should plan recurring operating budgets for serving, storage, and governance.
- Vendor choice and lock-in
- Reliance on managed vector DB (Vertex AI Vector Search) and Bigtable simplifies ops and reduces engineering overhead but creates vendor dependency and cost opacity. Economic decisions should weigh lower engineering costs vs long-term vendor pricing risk.
- Model choice and efficiency
- Smaller, instruction-tuned embeddings (qwen3-0.6B) delivered high retrieval quality while limiting vector dimensionality and storage footprint. This highlights a key economic trade-off: marginal gains from larger models may not justify increased storage/compute costs. Instruction tuning and design matter more than sheer model size.
- Externalities and governance costs
- Compliance, IRB processes, allowlists, PHI exclusions, and audit logging impose nontrivial governance overhead. Institutions must account for legal/compliance staffing and monitoring as part of the total cost of ownership.
- Value multipliers
- Beyond direct time savings, a shared semantic index enables downstream RAG applications, cohort generation, quality improvement, and faster research cycles—potentially amplifying ROI.
- Risks and ongoing costs
- Need for re-embedding and re-indexing over time due to model updates, new notes, and distributional changes; ongoing validation to monitor retrieval quality; costs associated with incident response, external audits, or security upgrades.
- Market implications
- The demonstrated economics (low monthly serving cost with high utility) suggest an addressable market for institutional semantic search offerings and managed services tailored to healthcare systems. Vendors and health systems should compete on price-performance, governance features, and strategies to reduce upfront embedding costs (e.g., pooled compute credits, federated sharing of embeddings, or smaller instruction-tuned models).
- Policy and labor implications
- Productivity gains may shift clinician/researcher workflows toward higher-value tasks, but workforce planning should anticipate shifting demand for chart abstraction roles, informatics support, and governance specialists.
Limitations and caveats relevant to economics - Reported operational costs exclude the full market cost of embedding computation (TRC credits used); organizations must budget that expense unless similar credits are available. - Results are from a single pediatric health system; generalizability and demand elasticities across other health systems may vary. - Performance and cost trade-offs depend on vendor pricing, regional cloud pricing, and architecture choices; local procurement terms can materially change the economics.
If you want, I can: - Run a breakeven calculator for your institution using custom values (expected chart reviews/month, average time saved per review, clinician hourly rate, one-time embedding cost). - Produce a cost sensitivity analysis comparing storage-optimized vs in-memory deployments and different embedding dimensionalities.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. Other | null_result | high | number_of_notes_indexed / index_size |
n=166000000
166 million clinical notes; 484 million vectors; 1.68 million patients
0.8
|
| The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. Other | null_result | high | system_architecture / governance_compliance |
0.8
|
| The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset. Other | null_result | high | model_and_chunking_configuration |
0.48
|
| The system delivers sub-second query latency: median 237 ms single-user, 451 ms at 20-user concurrency. Organizational Efficiency | positive | high | query_latency |
median 237 ms single-user, 451 ms 20-user concurrency
0.48
|
| Monthly operational cost of running the system is approximately USD 4,000. Organizational Efficiency | negative | high | monthly_operational_cost |
monthly costs of approximately USD 4,000
0.48
|
| Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. Output Quality | positive | high | accuracy_on_clinical_question_answering_benchmark |
94.6% accuracy
0.48
|
| In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review. Task Completion Time | positive | high | time-to-completion |
24 to 89% reduction
0.48
|
| Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time. Output Quality | null_result | high | inter-rater_agreement |
comparable inter-rater agreement
0.48
|
| Health-system-scale semantic search is both technically and operationally feasible. Adoption Rate | positive | medium | feasibility_of_health_system_scale_deployment |
0.29
|
| The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise. Adoption Rate | positive | medium | need_for_specialized_informatics_expertise / applicability_to_downstream_applications |
0.05
|