Decomposing AI recommendations into individually verifiable claims sharply raises clinician trust—trust rates jump from 27% to 66% and the effect is very large (d = 0.94). Traditional transparency tools produce only modest, dose‑response improvements.
Question: Does atomic fact-checking, which decomposes AI treatment recommendations into individually verifiable claims linked to source guideline documents, increase clinician trust compared to traditional explainability approaches? Findings: In this randomized trial of 356 clinicians generating 7,476 trust ratings, atomic fact-checking produced a large effect on trust (Cohen's d = 0.94), increasing the proportion of clinicians expressing trust from 26.9% to 66.5%. Traditional transparency mechanisms showed a dose-response gradient of improvement over baseline (d = 0.25 to 0.50). Meaning: Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions.
Summary
Main Finding
Atomic fact-checking (AFC)—decomposing LLM recommendations into discrete, verifiable claims each linked to the exact guideline passage—substantially increases clinician trust in oncology decision support versus traditional transparency measures. In a randomized trial of 356 clinicians producing 7,476 trust ratings, AFC raised mean trust from ~2.89 (pooled controls) to 3.80 on a 5-point scale (unstandardized difference 0.91, Cohen’s d = 0.94, P < .001) and increased the proportion expressing trust (score ≥ 4) from 26.9% to 66.5% (absolute increase 39.5 percentage points; NNT = 2.53).
Key Points
- Intervention tested: five presentation formats of GPT-4.5–generated oncology recommendations:
- Recommendation only
- Recommendation + natural-language explanation
- Recommendation + source citations
- Recommendation + explanation + citations
- Recommendation + explanation + citations + AFC (Group 5)
- Primary outcome: trust on validated 5-point Likert scale after each case (1–5).
- Main quantitative results:
- Mean trust scores: Group 1 = 2.59; Group 2 = 2.84; Group 3 = 3.01; Group 4 = 3.09; Group 5 (AFC) = 3.80.
- AFC vs pooled controls: Cohen’s d = 0.94 (95% CI 0.88–1.00), mean difference 0.91 (95% CI 0.86–0.96).
- Trust prevalence (score ≥ 4): pooled controls 26.9% vs AFC 66.5% (absolute +39.5 pp).
- Effect robust across specialties (radiology, medical oncology, radiation oncology), cancer types, and experience levels.
- Traditional transparency (explanations and citations) produced modest, dose-responsive improvements (d ≈ 0.08–0.50), but far smaller than AFC.
- Modeling: linear mixed-effects models with crossed random effects for participant and case; GEE for proportions; sensitivity analyses (ordinal models, imputations) confirmed robustness.
- Trial scope: 356 completed participants, 21 cases per participant across seven cancer types, 7,476 trust ratings.
Data & Methods
- Design: Prospective, randomized, controlled trial (CONSORT-aligned). Participants randomized to arms (with/without explanations) and sub-randomized to one of five group formats; stratified by specialty.
- Participants: 356 practicing clinicians (160 diagnostic radiologists, 111 medical oncologists, 85 radiation oncologists) recruited online. Diverse experience levels and prior AI exposure.
- Cases: 21 de-identified, clinically realistic cases per participant covering prostate, breast, lung, colorectal, kidney, liver (HCC), and lymphoma; each specialty had discipline-specific case sets and decision endpoints.
- AI outputs: Recommendations produced by GPT-4.5 with specialty prompts and few-shot examples; outputs validated for guideline concordance by board-certified specialists.
- AFC intervention: decomposed each recommendation into atomic factual claims; each claim accompanied by a verification status indicator and direct link/highlight to the exact guideline passage supporting that claim.
- Outcomes & analysis:
- Primary: 5-point trust Likert score post-case.
- Secondary: proportion with trust (≥4), subgroup effects, NNT.
- Statistical approach: linear mixed-effects models for mean differences (participant and case random effects), GEE for proportions (exchangeable correlation), Tukey-adjusted pairwise tests, sensitivity checks (ordinal mixed models, multiple imputation).
- Trial validity notes: investigators analyzing outcomes were blinded to allocation; participants blinded to alternative formats. ICC participant = 0.18; case ICC = 0.06.
Implications for AI Economics
- Adoption & Diffusion
- Trust is a key barrier to clinical adoption. AFC materially increases clinicians’ willingness to accept LLM recommendations, implying higher uptake rates for decision-support tools that incorporate AFC-style verification.
- High effect size and low NNT imply relatively small investments in AFC UX could yield large increases in adoption at the clinician level.
- Value of Verification vs. Explainability
- AFC’s superior impact suggests economic value lies more in evidence-linking and verifiability (reducing cognitive verification cost) than in opaque natural-language explanations. Product teams and health IT buyers should prioritize verifiable-source UX over longer-form explanations.
- Product Design & Costing
- AFC requires granular mapping from model claims to guideline passages (development effort, guideline licensing/access, annotation/maintenance). These are upfront engineering and content-costs that may be recouped via faster adoption, higher utilization, or premium pricing for certified-deployments.
- Ongoing maintenance costs: guidelines change; AFC systems need pipelines for guideline updates and re-linking—this creates recurring costs but also potential revenue streams for curated guideline APIs/indices.
- Liability & Regulatory Considerations
- AFC reduces informational asymmetry by enabling clinicians to confirm claims against authoritative text, which could mitigate malpractice risk and support defensible use of AI recommendations. However, it does not remove responsibility—regulatory frameworks may favor systems that provide traceable evidence links.
- Regulators and payers may be more likely to approve/reimburse AI decision-support tools that enable atomic verifiability and clinician confirmability.
- Efficiency and Workflow Trade-offs
- AFC reduces the cognitive load for confirmation (itemized checks), but it introduces an explicit verification interaction. Net impact on time-per-decision is uncertain: AFC may speed decision confidence for many cases (reducing follow-up literature searches) but could add microtasks in marginal cases.
- Economic analyses should measure time savings (or losses) per case, downstream impact on ordering, referrals, and treatment choices, and potential reductions in adverse events from better-calibrated trust.
- Market & Competitive Implications
- AFC-like features can be product differentiators for clinical AI vendors. Firms that operationalize verifiable claim-linking at scale could capture market share in hospital procurement and specialty workflows.
- There may be a market for third-party services that maintain canonical guideline-to-claim mappings (data-as-a-service for AFC).
- Externalities & Systemic Effects
- If AFC increases reliance on AI in contexts where the underlying model is incorrect, AFC could crystallize errors if claims are mis-linked or if guideline interpretation is flawed. Thus economic value depends on fidelity of claim extraction and linking.
- Positive externality: better calibrated clinician trust could increase productivity and standardize care, potentially reducing variation and downstream costs.
- Research & Evaluation Needs for Economic Modeling
- This study measures trust (a leading adoption indicator) but not clinical accuracy, patient outcomes, or workflow time—key inputs for cost-effectiveness and ROI models.
- Next steps for economic assessment: randomized trials measuring downstream utilization, time-to-decision, error rates, patient outcomes, and maintenance costs of AFC systems.
Caveats and limitations to consider for economic interpretation - Outcome is clinician-reported trust, not objective patient outcomes or correctness of decisions; stronger economic claims require linkage to clinical effectiveness and resource use. - Trial used GPT-4.5 and curated guideline links in a controlled study; real-world integration, guideline coverage, and model fidelity may vary. - AFC production has nontrivial engineering and curation costs; economic ROI depends on scale, update frequency, and willingness-to-pay by provider organizations.
Overall: AFC appears a high-impact, design-level intervention to raise clinician trust in LLM recommendations. From an AI-economics perspective, investing in verifiable, guideline-linked outputs could materially increase adoption and justify the associated development and maintenance costs—provided AFC systems are accurate, updatable, and integrated to minimize workflow friction.
Assessment
Claims (5)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Atomic fact-checking produced a large effect on clinician trust (Cohen's d = 0.94). Worker Satisfaction | positive | high | clinician trust (trust ratings) |
n=356
Cohen's d = 0.94
1.0
|
| Atomic fact-checking increased the proportion of clinicians expressing trust from 26.9% to 66.5%. Worker Satisfaction | positive | high | proportion of clinicians expressing trust |
n=356
increase from 26.9% to 66.5%
1.0
|
| Traditional transparency/explainability mechanisms showed a dose-response gradient of improvement over baseline (Cohen's d ranged from 0.25 to 0.50). Worker Satisfaction | positive | high | clinician trust (trust ratings) under traditional transparency mechanisms |
n=356
Cohen's d = 0.25 to 0.50
0.6
|
| The study was a randomized trial of 356 clinicians generating 7,476 trust ratings. Other | null_result | high | number of trust ratings collected (trial metadata) |
n=356
7,476 trust ratings collected
1.0
|
| Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions. Worker Satisfaction | positive | high | clinician trust in AI treatment recommendations |
n=356
Cohen's d = 0.94 (atomic fact-checking) vs d = 0.25–0.50 (traditional mechanisms); increase from 26.9% to 66.5% trusting
1.0
|