Providing three retrieved passages lets humans check AI extractions nearly as accurately as reading the full source while saving time; by contrast, LLM-generated explanations are faster but encourage over-reliance and make users less likely to spot errors, particularly on complex answers.
With increasing awareness of the hallucination risks of generative artificial intelligence (AI), we see a growing shift toward providing information tooling to help users determine the veracity of AI-generated answers for themselves. User responsibility for assessing veracity is particularly critical for certain sectors that rely on on-demand, AI-generated data extraction, such as biomedical research and the legal sector. While prior work offers us a variety of ways in which systems can provide such support, there is a lack of empirical evidence on how this information is actually incorporated into the user's decision-making process. Our user study takes a step toward filling this knowledge gap. In the context of a generative AI data extraction tool, we examine the relationship between the type of supporting information (full source text, passage retrieval, and Large Language Model (LLM) explanations) and user behavior in the veracity assessment process, examined through the lens of efficiency, effectiveness, reliance and trust. We find that passage retrieval offers a reasonable compromise between accuracy and speed, with judgments of veracity comparable to using the full source text. LLM explanations, while also enabling rapid assessments, fostered inappropriate reliance and trust on the data extraction AI, such that participants were less likely to detect errors. In additiona, we analyzed the impacts of the complexity of the information need, finding preliminary evidence that inappropriate reliance is worse for complex answers. We demonstrate how, through rigorous user evaluation, we can better develop systems that allow for effective and responsible human agency in veracity assessment processes.
Summary
Main Finding
Passage retrieval (showing the source passages the model used) is a practical middle ground: it yields veracity judgments comparable to presenting the full source text while being faster. By contrast, LLM-generated explanations speed up assessment but encourage inappropriate reliance and trust, causing users to miss model errors—an effect that appears stronger for complex information needs.
Key Points
- Problem: Users increasingly must assess the veracity of AI-generated data extractions, especially in high‑stakes domains (biomedical research, law).
- Intervention: Compared three supporting-information strategies in a generative-AI data‑extraction interface:
- Full source text
- Passage retrieval (the specific passages cited)
- LLM explanations (model‑generated justification)
- Evaluation criteria: efficiency (time), effectiveness (accuracy of veracity judgments), reliance (tendency to accept outputs without verification), and trust.
- Results:
- Passage retrieval ≈ full text for accuracy, faster than reading full documents.
- LLM explanations enable rapid assessments but increase inappropriate reliance; participants were less likely to detect extraction errors.
- Preliminary evidence that inappropriate reliance from LLM explanations is larger for more complex information needs.
- Conclusion: User-facing verification design matters — passage-level evidence supports effective, efficient human verification; model explanations can be misleading and create overreliance.
Data & Methods
- Setting: A user study using a generative-AI tool for on‑demand data extraction.
- Conditions compared: full source text, passage retrieval, and LLM explanations as verification aids.
- Outcome measures:
- Efficiency: time to reach veracity judgment
- Effectiveness: correctness of veracity judgments (ability to detect errors)
- Reliance: frequency of blindly accepting model output
- Trust: self‑reported or behaviorally inferred trust in the AI
- Additional analysis: Interaction between type/complexity of information need and verification behavior; found initial evidence that complexity magnifies inappropriate reliance on LLM explanations.
- Limitations (reported or implied):
- Study details (sample size, participant background, task variety) not specified here — results are preliminary and may vary by domain, user expertise, and task design.
- LLM explanation formats and quality can vary; effects depend on how explanations are generated and presented.
Implications for AI Economics
- Productivity vs. Verification Trade-off
- Passage retrieval reduces verification time while preserving accuracy, improving effective worker productivity where verification is required (e.g., legal review, literature synthesis).
- LLM explanations may increase apparent short‑run productivity but raise error risks that can be costly in high‑stakes settings.
- Risk Externalities and Liability
- Increased inappropriate reliance shifts error risk onto downstream stakeholders (clients, firms), creating negative externalities and potential liability exposure for AI providers or deploying organizations.
- Markets may price AI services to reflect required verification effort or allocate liability via contracts/insurance; empirical estimates of verification-related costs will be needed.
- Design and Productization Decisions
- Firms should prioritize retrieval-style evidence interfaces in products aimed at professional or regulated users to reduce verification costs and error rates.
- Offering LLM explanations without strong provenance may create moral hazard (users overtrust); vendors could face reputational and regulatory costs.
- Labor and Skill Composition
- Demand for workers with verification skills (critical reading, source-validation) may rise; some routine verification tasks can be sped up via passage retrieval, shifting labor toward higher‑value oversight.
- Conversely, poorly designed explanation features could deskill users and increase downstream correction costs.
- Market for Verification Tools and Third‑Party Audits
- Opportunity for third‑party verification services, provenance-aware retrieval systems, and audit tools that quantify uncertainty and source fidelity.
- Pricing models might bundle provenance features or charge for different verification tiers (e.g., full text vs. passage vs. flagged explanations).
- Regulation and Standards
- Regulators and standards bodies may favor provenance and evidence-displaying interfaces over opaque model justifications for high‑risk domains.
- Standards for veracity-support affordances could become part of compliance regimes, affecting market entry and product design.
- Research & Investment Priorities
- Economic evaluations should quantify cost of false positives/negatives from different UI designs to inform optimal investments.
- Further studies needed on heterogeneous users (experts vs. novices), domain specificity, and long‑run behavioral adaptation (does overreliance persist or attenuate with experience?).
Suggested actions for stakeholders: - Product teams: default to passage-level provenance for professional workflows; make LLM explanations inspectable and clearly labeled as model interpretations, not authoritative evidence. - Firms & regulators: require or incentivize provenance display in regulated applications; develop liability frameworks that account for verification aids. - Economists and policymakers: measure verification costs and error externalities to inform pricing, liability, and subsidy/standardization decisions.
Assessment
Claims (6)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Passage retrieval offers a reasonable compromise between accuracy and speed, with judgments of veracity comparable to using the full source text. Decision Quality | positive | medium | veracity-judgment accuracy; time-to-assessment (efficiency) |
0.36
|
| LLM explanations enable rapid veracity assessments. Decision Quality | positive | medium | time-to-assessment (efficiency) |
0.36
|
| LLM explanations foster inappropriate reliance and trust on the data-extraction AI: participants were less likely to detect errors when provided with LLM explanations. Error Rate | negative | medium | error-detection rate; measures of reliance/trust |
0.36
|
| Preliminary evidence that inappropriate reliance on AI outputs is worse for complex information needs (complex answers). Error Rate | negative | low | error-detection rate and reliance stratified by complexity of question/answer |
preliminary/stratified
0.18
|
| User responsibility for assessing veracity is particularly critical in sectors that rely on on-demand, AI-generated data extraction, such as biomedical research and the legal sector. Governance And Regulation | positive | medium | not_measured/background |
0.36
|
| Rigorous user evaluation can help develop systems that allow for effective and responsible human agency in veracity-assessment processes. Organizational Efficiency | positive | medium | system design effectiveness for supporting human veracity assessment (inferred, not directly operationalized in abstract) |
0.36
|