Shapley explanation variants raise analyst confidence without improving accuracy: in 3,735 fraud-case reviews, popular quantitative explanation proxies (sparsity, faithfulness) did not predict human-perceived clarity or decision utility, creating a risk of automation bias in operational decision systems.

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares, Hugo Ferreira, Pedro Bizarro · April 24, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Across four risk datasets and 3,735 professional analyst reviews, eight Shapley explanation variants failed to improve objective analyst performance despite increasing decision confidence, and standard quantitative explanation metrics were poorly aligned with human-perceived clarity and decision utility.

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.

Summary

Main Finding

Standard quantitative XAI benchmarks for Shapley-based attributions (e.g., sparsity, faithfulness, deletion/insertion AUC) are poor predictors of human utility in high‑stakes, low‑latency decision workflows. Across 3,735 professional case reviews, no Shapley formulation improved objective analyst accuracy or review time, yet explanations consistently increased analysts’ confidence — revealing a pronounced automation‑bias risk. The study shows a fundamental decoupling between algorithmic proxies and downstream human outcomes, and provides guidance on which Shapley semantics behave more favorably in production settings.

Key Points

The paper audits eight Shapley formulations (fixed baselines, uniform, marginal, joint‑marginal, conditional, counterfactual, filtered conditional) under a unified amortized estimation framework to remove implementation confounders.
Quantitative metrics exhibit structured trade‑offs:
- Fixed zero baseline yields strong deletion AUC and top‑feature recall but is semantically misaligned with empirical background sampling.
- Marginal and joint‑marginal variants give balanced performance (moderate sparsity, low sensitivity).
- Filtered conditional and some counterfactual variants maximize sparsity/contrastivity and insertion AUC but are unstable (sensitive to perturbations).
Amortization (a learned universal approximator) reproduced KernelSHAP ground truth with high fidelity, enabling millisecond inference suitable for production SLAs.
Human study (37 analysts, 3,735 reviews across 5 risk datasets including a production fraud dataset) used a blinded within‑subjects design with identical UI/interaction; measured decision accuracy, time, self‑reported confidence and clarity.
Main behavioral results:
- No Shapley formulation produced reliable accuracy gains or faster decision times.
- Explanations systematically increased analyst confidence (odds ratios > 1 for several variants) without corresponding accuracy improvements.
- Perceived clarity varied by formulation (joint‑marginal often rated clearer; zero baseline, uniform, and filtered conditional often rated confusing).
The authors release the interaction dataset and code to support reproducibility and further behavioral XAI benchmarking.

Data & Methods

Datasets: Five risk/tabular datasets — Maternal, Credit, HELOC, Adult, plus a proprietary real‑world fraud dataset used inside a sandboxed production review pipeline.
Models: Logistic Regression and LightGBM (industry‑standard, low‑latency models).
Shapley formulations: Eight semantic variants differentiated by the background distribution for "feature absence" (zero/mean baselines, uniform hyperbox, empirical marginal, joint‐marginal, conditional, counterfactual, filtered conditional).
Computational control: Amortized attribution model trained to minimize a weighted least squares loss across coalitions; validated against high‑sample KernelSHAP per‑formulation references (measured MSE, Recall@k).
Quantitative metrics: Deletion AUC, Insertion AUC, perturbation sensitivity, counterfactual contrastivity, sparsity (L1/L2 ratio), cross‑formulation rank agreement.
Human study design: Blinded randomized within‑subjects design; identical interface showing model score, Shapley bar chart, and natural‑language reason codes; participants made binary risk/no‑risk decisions and reported confidence/clarity. Mixed‑effects models controlled for analyst experience, case difficulty, model entropy and error; aggregated across datasets and models.
Scale/outputs: 37 participants, 3,735 logged reviews; results reported with bootstrapped errors and mixed‑effects inference (odds ratios, multiplicative effects).

Implications for AI Economics

Evaluation economics: Common offline proxies (sparsity, deletion/insertion AUC, "faithfulness") do not reliably predict how explanations affect human decisions or welfare. Economic evaluations of XAI investments should not assume metric improvements translate into improved decision outcomes.
Automation bias as an economic externality: Explanations that increase user confidence without improving accuracy can lead to overreliance on automated outputs, raising operational risk, higher expected losses, and potential regulatory liabilities in finance, healthcare, and other high‑stakes domains.
Product and deployment choices:
- Empirical marginal and joint‑marginal Shapley variants offer a balanced trade‑off and may be preferable when stability and interpretability matter.
- Sparse/contrastive formulations (filtered conditional, counterfactual) can look more informative but may be unstable and encourage misplaced confidence; use with caution in high‑stakes pipelines.
- Fixed baselines (e.g., zero) can artificially inflate "faithfulness" metrics and top‑feature recall but may be semantically misleading for users.
- Amortized attribution is viable for production (low latency) and enables like‑for‑like comparisons across formulations.
Cost–benefit and governance: Firms should factor in the cost of additional risk (automation bias) when selecting XAI solutions. Human‑centered validation (measuring confidence–accuracy gaps, decision outcomes) is essential for estimating the true value of explanation investments and for compliance evidence.
Market and policy effects: Regulators and standard setters should consider behavioral metrics (confidence, clarity, automation bias) as part of XAI assessment frameworks. Mandating only model‑centric proxies will misalign incentives.
Research & measurement agenda for AI economics:
- Prioritize human‑in‑the‑loop, behaviorally‑grounded benchmarks when valuing XAI technologies.
- Develop economic models that incorporate the confidence–accuracy gap and downstream loss functions (e.g., false positives/negatives cost asymmetries).
- Study long‑run effects (learning, calibration, strategic adaptation) and cross‑firm externalities from deploying explanations that alter human trust dynamics.

Actionable takeaway for practitioners: do not rely solely on offline explainability metrics to choose a Shapley formulation for operational decision systems. Instead, run targeted human evaluations (measure accuracy, time, and the confidence–accuracy gap) under production constraints; favor empirical marginal/joint‑marginal variants for balanced behavior; and use amortized estimators to meet latency requirements while enabling fair comparisons.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Large-scale, real-world evaluation with professional analysts (3,735 case reviews) and multiple datasets provides strong empirical signal about human impacts, but causal claims are weakened by limited information about randomization/blinding, potential selection and learning effects, domain restriction to risk/fraud workflows, and reliance on subjective measures alongside objective outcomes. Methods Rigormedium — The study uses a unified amortized implementation and multiple datasets, and measures both objective and subjective outcomes, which is methodologically robust; however, the description lacks explicit statements about random assignment, pre-registration, blinding, counterbalancing, and how analysts/cases were sampled or aggregated, leaving open possible confounds and experimenter-demand or ordering effects. SampleEmpirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts who reviewed 3,735 cases; compares eight Shapley-value explanation formulations under operational low-latency constraints, measuring objective analyst performance (decisions), subjective clarity and confidence, and standard quantitative explanation metrics (e.g., sparsity, faithfulness). Themeshuman_ai_collab adoption org_design IdentificationControlled comparison of eight Shapley explanation variants in a unified amortized framework: analysts reviewed cases under different explanation formulations and outcomes (objective performance, confidence, clarity ratings) were compared across variants while accounting for case characteristics and analyst-level effects to isolate formulation-specific differences. GeneralizabilitySingle domain: fraud/risk detection only — results may not transfer to other decision tasks (medical, hiring, creative, coding)., Professional analysts from specific organisations — sample may not represent other user populations (lay users, different expertise levels)., Specific model(s), dataset(s), and implementation/latency constraints used — alternate models or explanation implementations could behave differently., Cultural, regulatory and workflow differences across organizations may change decision dynamics and automation bias., Findings about subjective confidence vs. accuracy may depend on case mix, stakes, and training that vary across settings.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Other	null_result	high	number of case reviews / scale of empirical evaluation	n=3735 0.8
Standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Decision Quality	null_result	high	correlation/alignment between quantitative explanation metrics (sparsity, faithfulness) and human-perceived clarity/decision utility	n=3735 0.48
No formulation improved objective analyst performance. Decision Quality	null_result	high	objective analyst performance (e.g., accuracy on case reviews)	n=3735 0.48
Explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. Worker Satisfaction	positive	high	decision confidence (self-reported)	n=3735 0.48
We use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. Other	null_result	high	ability to isolate semantic differences among Shapley variants under low-latency constraints	0.48
The proliferation into competing Shapley formulations has created a fragmented landscape with little consensus on practical deployment. Other	negative	medium	degree of consensus on practical deployment of Shapley formulations	0.05
Current evaluation proxies are insufficient for predicting downstream human impact. Decision Quality	negative	high	predictive validity of quantitative evaluation proxies for human impact	n=3735 0.48
We provide evidence-based guidance for selecting formulations and metrics in operational decision systems. Organizational Efficiency	positive	high	availability of practical guidance for selection of explanation formulations and metrics	0.08