Providing three retrieved passages lets humans check AI extractions nearly as accurately as reading the full source while saving time; by contrast, LLM-generated explanations are faster but encourage over-reliance and make users less likely to spot errors, particularly on complex answers.

To Believe or Not To Believe: Comparing Supporting Information Tools to Aid Human Judgments of AI Veracity

Jessica Irons, Patrick Cooper, Necva Bolucu, Roelien Timmer, Huichen Yang, Changhyun Lee, Brian Jin, Andreas Duenser, Stephen Wan · March 12, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

BM25 passage retrieval (TopK) speeds human veracity checks while preserving accuracy similar to reading the full source, whereas LLM-generated explanations also speed assessments but induce inappropriate reliance that reduces detection of incorrect AI outputs, especially for complex items.

With increasing awareness of the hallucination risks of generative artificial intelligence (AI), we see a growing shift toward providing information tooling to help users determine the veracity of AI-generated answers for themselves. User responsibility for assessing veracity is particularly critical for certain sectors that rely on on-demand, AI-generated data extraction, such as biomedical research and the legal sector. While prior work offers us a variety of ways in which systems can provide such support, there is a lack of empirical evidence on how this information is actually incorporated into the user's decision-making process. Our user study takes a step toward filling this knowledge gap. In the context of a generative AI data extraction tool, we examine the relationship between the type of supporting information (full source text, passage retrieval, and Large Language Model (LLM) explanations) and user behavior in the veracity assessment process, examined through the lens of efficiency, effectiveness, reliance and trust. We find that passage retrieval offers a reasonable compromise between accuracy and speed, with judgments of veracity comparable to using the full source text. LLM explanations, while also enabling rapid assessments, fostered inappropriate reliance and trust on the data extraction AI, such that participants were less likely to detect errors. In additiona, we analyzed the impacts of the complexity of the information need, finding preliminary evidence that inappropriate reliance is worse for complex answers. We demonstrate how, through rigorous user evaluation, we can better develop systems that allow for effective and responsible human agency in veracity assessment processes.

Summary

Main Finding

Passage retrieval (showing the source passages the model used) is a practical middle ground: it yields veracity judgments comparable to presenting the full source text while being faster. By contrast, LLM-generated explanations speed up assessment but encourage inappropriate reliance and trust, causing users to miss model errors—an effect that appears stronger for complex information needs.

Key Points

Problem: Users increasingly must assess the veracity of AI-generated data extractions, especially in high‑stakes domains (biomedical research, law).
Intervention: Compared three supporting-information strategies in a generative-AI data‑extraction interface:
Full source text
Passage retrieval (the specific passages cited)
LLM explanations (model‑generated justification)
Evaluation criteria: efficiency (time), effectiveness (accuracy of veracity judgments), reliance (tendency to accept outputs without verification), and trust.
Results:
- Passage retrieval ≈ full text for accuracy, faster than reading full documents.
- LLM explanations enable rapid assessments but increase inappropriate reliance; participants were less likely to detect extraction errors.
- Preliminary evidence that inappropriate reliance from LLM explanations is larger for more complex information needs.
Conclusion: User-facing verification design matters — passage-level evidence supports effective, efficient human verification; model explanations can be misleading and create overreliance.

Data & Methods

Setting: A user study using a generative-AI tool for on‑demand data extraction.
Conditions compared: full source text, passage retrieval, and LLM explanations as verification aids.
Outcome measures:
- Efficiency: time to reach veracity judgment
- Effectiveness: correctness of veracity judgments (ability to detect errors)
- Reliance: frequency of blindly accepting model output
- Trust: self‑reported or behaviorally inferred trust in the AI
Additional analysis: Interaction between type/complexity of information need and verification behavior; found initial evidence that complexity magnifies inappropriate reliance on LLM explanations.
Limitations (reported or implied):
- Study details (sample size, participant background, task variety) not specified here — results are preliminary and may vary by domain, user expertise, and task design.
- LLM explanation formats and quality can vary; effects depend on how explanations are generated and presented.

Implications for AI Economics

Productivity vs. Verification Trade-off
- Passage retrieval reduces verification time while preserving accuracy, improving effective worker productivity where verification is required (e.g., legal review, literature synthesis).
- LLM explanations may increase apparent short‑run productivity but raise error risks that can be costly in high‑stakes settings.
Risk Externalities and Liability
- Increased inappropriate reliance shifts error risk onto downstream stakeholders (clients, firms), creating negative externalities and potential liability exposure for AI providers or deploying organizations.
- Markets may price AI services to reflect required verification effort or allocate liability via contracts/insurance; empirical estimates of verification-related costs will be needed.
Design and Productization Decisions
- Firms should prioritize retrieval-style evidence interfaces in products aimed at professional or regulated users to reduce verification costs and error rates.
- Offering LLM explanations without strong provenance may create moral hazard (users overtrust); vendors could face reputational and regulatory costs.
Labor and Skill Composition
- Demand for workers with verification skills (critical reading, source-validation) may rise; some routine verification tasks can be sped up via passage retrieval, shifting labor toward higher‑value oversight.
- Conversely, poorly designed explanation features could deskill users and increase downstream correction costs.
Market for Verification Tools and Third‑Party Audits
- Opportunity for third‑party verification services, provenance-aware retrieval systems, and audit tools that quantify uncertainty and source fidelity.
- Pricing models might bundle provenance features or charge for different verification tiers (e.g., full text vs. passage vs. flagged explanations).
Regulation and Standards
- Regulators and standards bodies may favor provenance and evidence-displaying interfaces over opaque model justifications for high‑risk domains.
- Standards for veracity-support affordances could become part of compliance regimes, affecting market entry and product design.
Research & Investment Priorities
- Economic evaluations should quantify cost of false positives/negatives from different UI designs to inform optimal investments.
- Further studies needed on heterogeneous users (experts vs. novices), domain specificity, and long‑run behavioral adaptation (does overreliance persist or attenuate with experience?).

Suggested actions for stakeholders: - Product teams: default to passage-level provenance for professional workflows; make LLM explanations inspectable and clearly labeled as model interpretations, not authoritative evidence. - Firms & regulators: require or incentivize provenance display in regulated applications; develop liability frameworks that account for verification aids. - Economists and policymakers: measure verification costs and error externalities to inform pricing, liability, and subsidy/standardization decisions.

Assessment

Paper Typerct Evidence Strengthmedium — Strong internal validity from random assignment, controlled stimuli, and consistency across participants, but limited external validity due to a small, artificial task setting (only three source articles, pre-generated AI outputs and contrived incorrect answers), single retrieval and LLM models, and unspecified/likely limited sample size. Methods Rigormedium — Well-designed controlled experiment using standard measures (NASA-TLX, S-TIAS), careful pre-generation of outputs, and plausible retrieval/LLM baselines (BM25, GPT-OSS), but limitations include few source documents, injected (rather than naturally occurring) errors, use of one LLM and one retrieval method, lack of reported sample size/power calculations in the excerpt, and a between-subjects design that may reduce power. SampleHuman participants performed veracity assessments on structured AI-extracted answers drawn from three science-themed news articles (The Conversation); for each article five fields were extracted (15 answers total) using GPT-OSS:20B, with ~30% of answers artificially made incorrect; supporting information presented per condition was (1) the full article PDF, (2) three BM25-retrieved paragraphs (TopK), or (3) an LLM-generated explanation; participant demographics and sample size are not reported in the provided excerpt. Themeshuman_ai_collab productivity adoption IdentificationRandomized between-subjects user experiment: participants were randomly assigned to one of three supporting-information conditions (full PDF, BM25 Top-3 passage retrieval, or LLM-generated explanation). AI-generated answers and supporting information were pre-generated and incorrect answers were systematically injected (~30%) to create a controlled test of participants' veracity judgments. GeneralizabilityDomain limited to three science-news articles written for general readership (not specialized technical literature), Only five information fields per article—may not represent more complex or diverse extraction tasks, Errors were artificially created; real-world hallucinations may differ in nature, Single retrieval method (BM25) and single LLM variant used—results may not generalize to other models or retrieval systems, Laboratory-style, one-off tasks; does not capture long-term learning, repeated use, or workplace workflows, Participant sample details not reported; likely non-representative (e.g., crowdworkers or volunteers)

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Passage retrieval offers a reasonable compromise between accuracy and speed, with judgments of veracity comparable to using the full source text. Decision Quality	positive	medium	veracity-judgment accuracy; time-to-assessment (efficiency)	0.36
LLM explanations enable rapid veracity assessments. Decision Quality	positive	medium	time-to-assessment (efficiency)	0.36
LLM explanations foster inappropriate reliance and trust on the data-extraction AI: participants were less likely to detect errors when provided with LLM explanations. Error Rate	negative	medium	error-detection rate; measures of reliance/trust	0.36
Preliminary evidence that inappropriate reliance on AI outputs is worse for complex information needs (complex answers). Error Rate	negative	low	error-detection rate and reliance stratified by complexity of question/answer	preliminary/stratified 0.18
User responsibility for assessing veracity is particularly critical in sectors that rely on on-demand, AI-generated data extraction, such as biomedical research and the legal sector. Governance And Regulation	positive	medium	not_measured/background	0.36
Rigorous user evaluation can help develop systems that allow for effective and responsible human agency in veracity-assessment processes. Organizational Efficiency	positive	medium	system design effectiveness for supporting human veracity assessment (inferred, not directly operationalized in abstract)	0.36