AI-written explanations don’t make people more accurate: narrative explanations from large language models fail to raise classification accuracy over raw AI predictions, while making people more likely to follow the AI (even when it’s wrong) and slowing decisions.

Human Decision-Making with Persuasive and Narrative LLM Explanations

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu · May 22, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

A randomized experiment shows that LLM-generated narrative explanations—regardless of persuasiveness—do not meaningfully change classification accuracy compared with AI predictions alone, but they increase reliance on AI (including when incorrect) and can slow responses and reduce discrimination between correct and incorrect predictions.

Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

Summary

Main Finding

LLM-generated narrative explanations — whether neutral, somewhat persuasive, or extremely persuasive — did not meaningfully change objective human decision accuracy compared with presenting an AI prediction alone. However, narrative explanations increased human reliance on the AI (people followed the AI prediction more often), and exploratory analyses suggested persuasive narratives may harm response-time efficiency and the ability to discriminate between correct and incorrect AI predictions.

Key Points

Experiment was pre-registered and large-scale (320 participants; 8 experimental conditions × 40 participants each).
Three explanation conditions were compared to a prediction-alone baseline: Neutral, Low Persuasion, and Extreme Persuasion.
Primary outcome (decision accuracy) showed no significant effect of explanation condition (ANOVA: F(3,312)=0.91, p=0.44). Dataset differences (Census vs Student) did drive accuracy differences (F(1,312)=33.42, p<0.001).
Reliance on AI (proportion of trials where participant followed the AI prediction) was affected by explanation condition (F(3,312)=3.15, p=0.03). Post-hoc tests: Neutral explanations produced significantly higher reliance than Prediction Alone (t(312)=2.83, p=0.03); Low Persuasion was marginal vs Prediction Alone.
Increased reliance occurred both when the AI prediction was correct and when it was incorrect — i.e., narratives raised following of AI regardless of its ground-truth accuracy.
Exploratory results indicate persuasive narratives may (a) increase response times and (b) reduce participants’ ability to discriminate correct from incorrect AI predictions (i.e., reduce effective decision quality even if aggregate accuracy unchanged).
Pre-registered hypotheses expecting reduced accuracy with higher persuasion were not supported.
Limitations noted by authors: persuasive prompts sometimes introduce confounds (exaggerated/manipulative language), stimuli were a subset of two UCI datasets, and participants were general online subjects (Prolific).

Data & Methods

Datasets: subsets from UCI Census Income and Student Performance datasets. For each dataset, 58 raw instances were used to prompt the LLM; 52 instances (6 practice, 46 test) were selected for stimuli; per participant 20 test trials (balanced classes).
LLM & prompting: OpenAI GPT-4o (Omni) via API, temperature = 0.1, zero-shot prompting. Three final-instruction variants produced Neutral, Low Persuasion, and Extreme Persuasion narrative explanations for each instance. Text analyses confirmed differences on persuasion-related metrics, though Extreme invoked some exaggerated language (a confound).
Participants & procedure: 320 participants recruited on Prolific (40 per condition). Trial presentation: tabular individual features, AI prediction (Pass/Fail), and optionally a narrative explanation shown with a typewriter effect; participants then made a binary choice, reported confidence (1–7), and received feedback. Response times and confidence calibration were recorded.
Measures and analysis: primary measures were decision accuracy (vs ground truth), AI reliance (followed AI prediction), response time, and confidence calibration (multilevel modeling). Outlier exclusion: trial RTs > 3 SD above participant mean removed (2.3% of trials).

Implications for AI Economics

Tradeoffs between uptake and accuracy: Narrative explanations can increase user reliance and therefore adoption of AI recommendations without improving (and potentially impairing) net decision quality. In economic settings (credit approvals, hiring, medical triage, loan underwriting), greater uptake of AI recommendations could increase operational efficiency but also amplify systematic errors when the AI is wrong, producing social and financial costs.
Externalities of persuasive LLMs: Persuasive narratives that do not correlate with model correctness can create negative externalities — e.g., firms deploying persuasive explanations may see short-term metrics (conversion, compliance) improve while downstream welfare (defaults, misallocated labor) deteriorates.
Evaluation metrics for deployed systems: Beyond predictive accuracy, deployers and regulators should measure (a) human-AI reliance patterns, (b) discrimination of correct vs incorrect model outputs by human overseers, and (c) downstream economic impacts of false positives/negatives. Cost-sensitive metrics and counterfactual welfare simulations are important.
Policy and design interventions:
- Require or incentivize disclosure and uncertainty/ confidence calibration in explanations; prior work suggests some disclosures reduce trust but can improve appropriate reliance.
- Use calibrated uncertainty and concise, accuracy-linked explanations rather than purely persuasive narratives.
- Consider adaptive explanation policies: enable stronger narrative assistance only when model confidence and reliability in that subdomain are high.
- Audit persuasive properties of explanations and test the economic consequences of increased reliance on incorrect predictions (e.g., expected loss calculations).
Research agenda for AI economics:
- Quantify economic costs of false reliance induced by narrative explanations (simulate monetary loss under different reliance rates and error structures).
- Study task- and domain-dependence: when do narratives actually improve joint human-AI welfare (e.g., low-stakes training vs high-stakes allocation)?
- Evaluate personalization effects: does tailoring narrative tone/content to users improve calibration or simply increase blind trust? How does that affect distributional outcomes across socioeconomic groups?
- Test interventions (uncertainty cues, persistent disclaimers, expert citations) that might preserve the adoption benefits of narratives while reducing harmful overreliance.
Practical recommendation for practitioners: If deploying narrative LLM explanations in economically consequential decisions, run controlled pilots that measure both reliance and downstream economic metrics (losses, inequities). Prefer explanations tied to model confidence and domain evidence; instrument and monitor for increased following of incorrect recommendations and intervene (e.g., escalate to human review) when model uncertainty is high.

Limitations worth remembering for interpreting the findings: non-expert online sample, limited datasets, possible confounds in persuasion manipulation (style, length), and use of a single LLM/version (GPT-4o, March 2025). Results indicate important tradeoffs but not a universal rule — impact of narrative explanations is likely task- and context-dependent.

Assessment

Paper Typerct Evidence Strengthmedium — Random assignment provides credible internal identification of the effect of narrative explanations on decision performance, but external validity is limited (online/lab tasks, single/small set of tasks and prompts/LLM variants), and some analyses were exploratory rather than pre-registered. Methods Rigormedium — Large-scale randomized design and direct behavioral outcomes are strengths; however, potential issues include limited ecological validity, likely reliance on convenience online samples, possible lack of pre-registration or robustness checks reported here, and dependence on specific prompts/LLM instances which may limit robustness across contexts. SampleA large sample of human participants from an online behavioral panel (details not provided here) who completed controlled classification tasks where they saw AI predictions and, in some arms, LLM-generated narrative explanations of varying persuasiveness; outcomes measured included classification accuracy, reliance on AI (choice behavior), response times, and discrimination between correct and incorrect AI predictions. Themeshuman_ai_collab productivity IdentificationRandomized controlled experiment: participants were randomly assigned to see an AI prediction alone or an AI prediction plus an LLM-generated narrative explanation whose persuasiveness was experimentally varied; causal effects are identified by between‑condition comparisons of decision accuracy, reliance, response time, and discrimination. GeneralizabilityOnline experimental participants may not represent professional/real-world decision-makers, Tasks were specific classification tasks and may not reflect other decision contexts (e.g., high-stakes, multi-step, or domain-expert settings), Single/small set of LLM model(s), prompts, and persuasion manipulations limit generalization across models and explanation styles, Short-term, one-off decisions; does not capture learning, long-run behavior, or organizational adoption dynamics, Likely limited geographic/language/demographic diversity (typical of online panels)

Claims (6)

Claim	Direction	Confidence	Outcome	Details
The degree of persuasiveness for LLM-based narrative explanations did not meaningfully impact decision accuracy over a simple AI prediction alone. Decision Quality	null_result	high	decision accuracy	0.6
Narrative explanations increased reliance on the AI, both when the AI prediction was correct and when it was incorrect. Task Allocation	positive	high	reliance on AI	0.6
More persuasive narratives may have had a detrimental effect on decision response times. Task Completion Time	negative	high	decision response time	0.3
More persuasive narratives may have had a detrimental effect on the ability to discriminate between a correct and incorrect AI prediction. Decision Quality	negative	high	ability to discriminate correct vs. incorrect AI predictions	0.3
Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions. Worker Satisfaction	positive	high	perceived understandability/trustworthiness/convincingness of narrative explanations	0.6
Including narrative explanations with AI predictions may involve tradeoffs for decision-making performance. Decision Quality	mixed	high	overall decision-making performance (tradeoffs across accuracy, reliance, response time, discrimination)	0.6