← Papers

Decomposing AI recommendations into individually verifiable claims sharply raises clinician trust—trust rates jump from 27% to 66% and the effect is very large (d = 0.94). Traditional transparency tools produce only modest, dose‑response improvements.

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

Lisa C. Adams, Linus Marx, Erik Thiele Orberg, Keno Bressem, Sebastian Ziegelmayer, Denise Bernhardt, Markus Graf, Marcus R. Makowski, Stephanie E. Combs, Florian Matthes, Jan C. Peeken · May 05, 2026

arxiv rct high evidence 7/10 relevance Source PDF

In a randomized trial of 356 clinicians (7,476 ratings), presenting AI recommendations as atomic, verifiable claims linked to guideline sources markedly increased clinician trust from 26.9% to 66.5% (Cohen's d = 0.94) compared with traditional explainability approaches.

Question: Does atomic fact-checking, which decomposes AI treatment recommendations into individually verifiable claims linked to source guideline documents, increase clinician trust compared to traditional explainability approaches? Findings: In this randomized trial of 356 clinicians generating 7,476 trust ratings, atomic fact-checking produced a large effect on trust (Cohen's d = 0.94), increasing the proportion of clinicians expressing trust from 26.9% to 66.5%. Traditional transparency mechanisms showed a dose-response gradient of improvement over baseline (d = 0.25 to 0.50). Meaning: Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions.

Summary

Main Finding

Atomic fact-checking (AFC)—decomposing LLM recommendations into discrete, verifiable claims each linked to the exact guideline passage—substantially increases clinician trust in oncology decision support versus traditional transparency measures. In a randomized trial of 356 clinicians producing 7,476 trust ratings, AFC raised mean trust from ~2.89 (pooled controls) to 3.80 on a 5-point scale (unstandardized difference 0.91, Cohen’s d = 0.94, P < .001) and increased the proportion expressing trust (score ≥ 4) from 26.9% to 66.5% (absolute increase 39.5 percentage points; NNT = 2.53).

Key Points

Intervention tested: five presentation formats of GPT-4.5–generated oncology recommendations:
Recommendation only
Recommendation + natural-language explanation
Recommendation + source citations
Recommendation + explanation + citations
Recommendation + explanation + citations + AFC (Group 5)
Primary outcome: trust on validated 5-point Likert scale after each case (1–5).
Main quantitative results:
- Mean trust scores: Group 1 = 2.59; Group 2 = 2.84; Group 3 = 3.01; Group 4 = 3.09; Group 5 (AFC) = 3.80.
- AFC vs pooled controls: Cohen’s d = 0.94 (95% CI 0.88–1.00), mean difference 0.91 (95% CI 0.86–0.96).
- Trust prevalence (score ≥ 4): pooled controls 26.9% vs AFC 66.5% (absolute +39.5 pp).
- Effect robust across specialties (radiology, medical oncology, radiation oncology), cancer types, and experience levels.
Traditional transparency (explanations and citations) produced modest, dose-responsive improvements (d ≈ 0.08–0.50), but far smaller than AFC.
Modeling: linear mixed-effects models with crossed random effects for participant and case; GEE for proportions; sensitivity analyses (ordinal models, imputations) confirmed robustness.
Trial scope: 356 completed participants, 21 cases per participant across seven cancer types, 7,476 trust ratings.

Data & Methods

Design: Prospective, randomized, controlled trial (CONSORT-aligned). Participants randomized to arms (with/without explanations) and sub-randomized to one of five group formats; stratified by specialty.
Participants: 356 practicing clinicians (160 diagnostic radiologists, 111 medical oncologists, 85 radiation oncologists) recruited online. Diverse experience levels and prior AI exposure.
Cases: 21 de-identified, clinically realistic cases per participant covering prostate, breast, lung, colorectal, kidney, liver (HCC), and lymphoma; each specialty had discipline-specific case sets and decision endpoints.
AI outputs: Recommendations produced by GPT-4.5 with specialty prompts and few-shot examples; outputs validated for guideline concordance by board-certified specialists.
AFC intervention: decomposed each recommendation into atomic factual claims; each claim accompanied by a verification status indicator and direct link/highlight to the exact guideline passage supporting that claim.
Outcomes & analysis:
- Primary: 5-point trust Likert score post-case.
- Secondary: proportion with trust (≥4), subgroup effects, NNT.
- Statistical approach: linear mixed-effects models for mean differences (participant and case random effects), GEE for proportions (exchangeable correlation), Tukey-adjusted pairwise tests, sensitivity checks (ordinal mixed models, multiple imputation).
Trial validity notes: investigators analyzing outcomes were blinded to allocation; participants blinded to alternative formats. ICC participant = 0.18; case ICC = 0.06.

Implications for AI Economics

Adoption & Diffusion
- Trust is a key barrier to clinical adoption. AFC materially increases clinicians’ willingness to accept LLM recommendations, implying higher uptake rates for decision-support tools that incorporate AFC-style verification.
- High effect size and low NNT imply relatively small investments in AFC UX could yield large increases in adoption at the clinician level.
Value of Verification vs. Explainability
- AFC’s superior impact suggests economic value lies more in evidence-linking and verifiability (reducing cognitive verification cost) than in opaque natural-language explanations. Product teams and health IT buyers should prioritize verifiable-source UX over longer-form explanations.
Product Design & Costing
- AFC requires granular mapping from model claims to guideline passages (development effort, guideline licensing/access, annotation/maintenance). These are upfront engineering and content-costs that may be recouped via faster adoption, higher utilization, or premium pricing for certified-deployments.
- Ongoing maintenance costs: guidelines change; AFC systems need pipelines for guideline updates and re-linking—this creates recurring costs but also potential revenue streams for curated guideline APIs/indices.
Liability & Regulatory Considerations
- AFC reduces informational asymmetry by enabling clinicians to confirm claims against authoritative text, which could mitigate malpractice risk and support defensible use of AI recommendations. However, it does not remove responsibility—regulatory frameworks may favor systems that provide traceable evidence links.
- Regulators and payers may be more likely to approve/reimburse AI decision-support tools that enable atomic verifiability and clinician confirmability.
Efficiency and Workflow Trade-offs
- AFC reduces the cognitive load for confirmation (itemized checks), but it introduces an explicit verification interaction. Net impact on time-per-decision is uncertain: AFC may speed decision confidence for many cases (reducing follow-up literature searches) but could add microtasks in marginal cases.
- Economic analyses should measure time savings (or losses) per case, downstream impact on ordering, referrals, and treatment choices, and potential reductions in adverse events from better-calibrated trust.
Market & Competitive Implications
- AFC-like features can be product differentiators for clinical AI vendors. Firms that operationalize verifiable claim-linking at scale could capture market share in hospital procurement and specialty workflows.
- There may be a market for third-party services that maintain canonical guideline-to-claim mappings (data-as-a-service for AFC).
Externalities & Systemic Effects
- If AFC increases reliance on AI in contexts where the underlying model is incorrect, AFC could crystallize errors if claims are mis-linked or if guideline interpretation is flawed. Thus economic value depends on fidelity of claim extraction and linking.
- Positive externality: better calibrated clinician trust could increase productivity and standardize care, potentially reducing variation and downstream costs.
Research & Evaluation Needs for Economic Modeling
- This study measures trust (a leading adoption indicator) but not clinical accuracy, patient outcomes, or workflow time—key inputs for cost-effectiveness and ROI models.
- Next steps for economic assessment: randomized trials measuring downstream utilization, time-to-decision, error rates, patient outcomes, and maintenance costs of AFC systems.

Caveats and limitations to consider for economic interpretation - Outcome is clinician-reported trust, not objective patient outcomes or correctness of decisions; stronger economic claims require linkage to clinical effectiveness and resource use. - Trial used GPT-4.5 and curated guideline links in a controlled study; real-world integration, guideline coverage, and model fidelity may vary. - AFC production has nontrivial engineering and curation costs; economic ROI depends on scale, update frequency, and willingness-to-pay by provider organizations.

Overall: AFC appears a high-impact, design-level intervention to raise clinician trust in LLM recommendations. From an AI-economics perspective, investing in verifiable, guideline-linked outputs could materially increase adoption and justify the associated development and maintenance costs—provided AFC systems are accurate, updatable, and integrated to minimize workflow friction.

Assessment

Paper Typerct Evidence Strengthhigh — Randomized assignment provides credible causal identification and the reported large effect (Cohen's d = 0.94) is based on many observations (7,476 trust ratings), giving high internal validity for the effect of presentation format on self-reported clinician trust; limitations remain because outcomes are attitudinal (trust ratings) rather than downstream behavioral or patient outcomes. Methods Rigorhigh — Use of an RCT with a substantial number of observations and reported effect sizes indicates strong methodological rigor; however, the write-up lacks detail here on blinding, allocation concealment, balance tests, treatment fidelity, and statistical handling of within-clinician clustering, which are potential areas for scrutiny. Sample356 clinicians who provided trust ratings across multiple clinical vignettes/AI recommendations, yielding 7,476 total trust ratings; further details on clinician specialties, practice settings, geographic distribution, and recruitment procedures are not specified in the prompt. Themeshuman_ai_collab adoption IdentificationRandomized controlled trial: clinicians were randomly assigned to receive AI treatment recommendations presented either with atomic fact-checking (decomposed, verifiable claims linked to source guidelines), traditional explainability mechanisms, or baseline presentation; causal effects identified via random assignment of clinicians to arms and comparison of trust ratings across arms (analysis accounts for repeated ratings per clinician). GeneralizabilityVignette-based or simulated recommendation setting may not reflect real-world clinical workflows or stakes, Outcome is self-reported trust rather than actual adoption, decision-making, or patient outcomes, Sample representativeness unclear (specialty mix, experience level, geographic/cultural context unknown), Effect tied to the specific AI presentation and guideline sources used; may not generalize to other AI systems or domains, Short-term measurement; durability of trust changes over time is unknown

Claims (5)

Claim	Direction	Confidence	Outcome	Details
Atomic fact-checking produced a large effect on clinician trust (Cohen's d = 0.94). Worker Satisfaction	positive	high	clinician trust (trust ratings)	n=356 Cohen's d = 0.94 1.0
Atomic fact-checking increased the proportion of clinicians expressing trust from 26.9% to 66.5%. Worker Satisfaction	positive	high	proportion of clinicians expressing trust	n=356 increase from 26.9% to 66.5% 1.0
Traditional transparency/explainability mechanisms showed a dose-response gradient of improvement over baseline (Cohen's d ranged from 0.25 to 0.50). Worker Satisfaction	positive	high	clinician trust (trust ratings) under traditional transparency mechanisms	n=356 Cohen's d = 0.25 to 0.50 0.6
The study was a randomized trial of 356 clinicians generating 7,476 trust ratings. Other	null_result	high	number of trust ratings collected (trial metadata)	n=356 7,476 trust ratings collected 1.0
Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions. Worker Satisfaction	positive	high	clinician trust in AI treatment recommendations	n=356 Cohen's d = 0.94 (atomic fact-checking) vs d = 0.25–0.50 (traditional mechanisms); increase from 26.9% to 66.5% trusting 1.0