Shapley explanation variants raise analyst confidence without improving accuracy: in 3,735 fraud-case reviews, popular quantitative explanation proxies (sparsity, faithfulness) did not predict human-perceived clarity or decision utility, creating a risk of automation bias in operational decision systems.
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
Summary
Main Finding
Standard quantitative XAI benchmarks for Shapley-based attributions (e.g., sparsity, faithfulness, deletion/insertion AUC) are poor predictors of human utility in high‑stakes, low‑latency decision workflows. Across 3,735 professional case reviews, no Shapley formulation improved objective analyst accuracy or review time, yet explanations consistently increased analysts’ confidence — revealing a pronounced automation‑bias risk. The study shows a fundamental decoupling between algorithmic proxies and downstream human outcomes, and provides guidance on which Shapley semantics behave more favorably in production settings.
Key Points
- The paper audits eight Shapley formulations (fixed baselines, uniform, marginal, joint‑marginal, conditional, counterfactual, filtered conditional) under a unified amortized estimation framework to remove implementation confounders.
- Quantitative metrics exhibit structured trade‑offs:
- Fixed zero baseline yields strong deletion AUC and top‑feature recall but is semantically misaligned with empirical background sampling.
- Marginal and joint‑marginal variants give balanced performance (moderate sparsity, low sensitivity).
- Filtered conditional and some counterfactual variants maximize sparsity/contrastivity and insertion AUC but are unstable (sensitive to perturbations).
- Amortization (a learned universal approximator) reproduced KernelSHAP ground truth with high fidelity, enabling millisecond inference suitable for production SLAs.
- Human study (37 analysts, 3,735 reviews across 5 risk datasets including a production fraud dataset) used a blinded within‑subjects design with identical UI/interaction; measured decision accuracy, time, self‑reported confidence and clarity.
- Main behavioral results:
- No Shapley formulation produced reliable accuracy gains or faster decision times.
- Explanations systematically increased analyst confidence (odds ratios > 1 for several variants) without corresponding accuracy improvements.
- Perceived clarity varied by formulation (joint‑marginal often rated clearer; zero baseline, uniform, and filtered conditional often rated confusing).
- The authors release the interaction dataset and code to support reproducibility and further behavioral XAI benchmarking.
Data & Methods
- Datasets: Five risk/tabular datasets — Maternal, Credit, HELOC, Adult, plus a proprietary real‑world fraud dataset used inside a sandboxed production review pipeline.
- Models: Logistic Regression and LightGBM (industry‑standard, low‑latency models).
- Shapley formulations: Eight semantic variants differentiated by the background distribution for "feature absence" (zero/mean baselines, uniform hyperbox, empirical marginal, joint‐marginal, conditional, counterfactual, filtered conditional).
- Computational control: Amortized attribution model trained to minimize a weighted least squares loss across coalitions; validated against high‑sample KernelSHAP per‑formulation references (measured MSE, Recall@k).
- Quantitative metrics: Deletion AUC, Insertion AUC, perturbation sensitivity, counterfactual contrastivity, sparsity (L1/L2 ratio), cross‑formulation rank agreement.
- Human study design: Blinded randomized within‑subjects design; identical interface showing model score, Shapley bar chart, and natural‑language reason codes; participants made binary risk/no‑risk decisions and reported confidence/clarity. Mixed‑effects models controlled for analyst experience, case difficulty, model entropy and error; aggregated across datasets and models.
- Scale/outputs: 37 participants, 3,735 logged reviews; results reported with bootstrapped errors and mixed‑effects inference (odds ratios, multiplicative effects).
Implications for AI Economics
- Evaluation economics: Common offline proxies (sparsity, deletion/insertion AUC, "faithfulness") do not reliably predict how explanations affect human decisions or welfare. Economic evaluations of XAI investments should not assume metric improvements translate into improved decision outcomes.
- Automation bias as an economic externality: Explanations that increase user confidence without improving accuracy can lead to overreliance on automated outputs, raising operational risk, higher expected losses, and potential regulatory liabilities in finance, healthcare, and other high‑stakes domains.
- Product and deployment choices:
- Empirical marginal and joint‑marginal Shapley variants offer a balanced trade‑off and may be preferable when stability and interpretability matter.
- Sparse/contrastive formulations (filtered conditional, counterfactual) can look more informative but may be unstable and encourage misplaced confidence; use with caution in high‑stakes pipelines.
- Fixed baselines (e.g., zero) can artificially inflate "faithfulness" metrics and top‑feature recall but may be semantically misleading for users.
- Amortized attribution is viable for production (low latency) and enables like‑for‑like comparisons across formulations.
- Cost–benefit and governance: Firms should factor in the cost of additional risk (automation bias) when selecting XAI solutions. Human‑centered validation (measuring confidence–accuracy gaps, decision outcomes) is essential for estimating the true value of explanation investments and for compliance evidence.
- Market and policy effects: Regulators and standard setters should consider behavioral metrics (confidence, clarity, automation bias) as part of XAI assessment frameworks. Mandating only model‑centric proxies will misalign incentives.
- Research & measurement agenda for AI economics:
- Prioritize human‑in‑the‑loop, behaviorally‑grounded benchmarks when valuing XAI technologies.
- Develop economic models that incorporate the confidence–accuracy gap and downstream loss functions (e.g., false positives/negatives cost asymmetries).
- Study long‑run effects (learning, calibration, strategic adaptation) and cross‑firm externalities from deploying explanations that alter human trust dynamics.
Actionable takeaway for practitioners: do not rely solely on offline explainability metrics to choose a Shapley formulation for operational decision systems. Instead, run targeted human evaluations (measure accuracy, time, and the confidence–accuracy gap) under production constraints; favor empirical marginal/joint‑marginal variants for balanced behavior; and use amortized estimators to meet latency requirements while enabling fair comparisons.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Other | null_result | high | number of case reviews / scale of empirical evaluation |
n=3735
0.8
|
| Standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Decision Quality | null_result | high | correlation/alignment between quantitative explanation metrics (sparsity, faithfulness) and human-perceived clarity/decision utility |
n=3735
0.48
|
| No formulation improved objective analyst performance. Decision Quality | null_result | high | objective analyst performance (e.g., accuracy on case reviews) |
n=3735
0.48
|
| Explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. Worker Satisfaction | positive | high | decision confidence (self-reported) |
n=3735
0.48
|
| We use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. Other | null_result | high | ability to isolate semantic differences among Shapley variants under low-latency constraints |
0.48
|
| The proliferation into competing Shapley formulations has created a fragmented landscape with little consensus on practical deployment. Other | negative | medium | degree of consensus on practical deployment of Shapley formulations |
0.05
|
| Current evaluation proxies are insufficient for predicting downstream human impact. Decision Quality | negative | high | predictive validity of quantitative evaluation proxies for human impact |
n=3735
0.48
|
| We provide evidence-based guidance for selecting formulations and metrics in operational decision systems. Organizational Efficiency | positive | high | availability of practical guidance for selection of explanation formulations and metrics |
0.08
|