Agentic AI trims customer-service handling time but chips away at satisfaction: Taobao's experiment shows faster chats and improved attention to non-AI cases, yet customer ratings fall for AI-resolved conversations. Human interventions patch technical failures effectively, but struggle to assuage emotionally escalated customers unless engaged early and intensively.
Agentic AI systems that autonomously perform service tasks are entering customer service operations. However, limited evidence exists on how human interventions shape service outcomes when agentic AI failures create both cognitive and emotional consequences. We study this issue through a randomized field experiment on Alibaba's Taobao platform. Workers in the treatment condition supervised an agentic AI system that resolved AI-eligible chats while continuing to handle AI-ineligible chats, whereas control workers resolved all chats without agentic AI. The findings show that AI deployment reduces average chat duration and has limited effects on retrial rates, but substantially lowers ratings for AI-eligible chats. Moreover, human intervention effectiveness in AI-eligible chats depends on the nature of AI failure, post-escalation intervention effort, and intervention timing. Human intervention preserves service quality in algorithm-triggered technical escalations, i.e., unresolved customer issues beyond the AI's capability, but is less effective in algorithm-triggered emotional escalations, i.e., where customers express frustration or dissatisfaction. These differences are partly explained by variation in workers' post-escalation intervention effort across escalation types. In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision. We further find that early intervention is essential for sustaining high post-escalation intervention effort. Finally, we document a positive spillover effect on AI-ineligible chats, as treated workers adapted their multitasking workflow to devote greater attention to these chats. These findings offer implications for human-in-the-loop process design in human-AI collaboration systems.
Summary
Main Finding
Deploying agentic AI to autonomously handle a subset of standardized chats speeds up service (shorter chat duration) and produces positive workload spillovers onto human-handled (AI-ineligible) chats, but substantially reduces customer ratings for AI-eligible chats. Human-in-the-loop interventions salvage quality when AI failures are technical (capability gaps) but are far less effective when AI failures produce emotional escalation; timely intervention and higher post-escalation human effort are crucial to recovery.
Key Points
- Experiment: randomized field experiment at Alibaba (Taobao), Aug 2024.
- Subjects: 647 customer-service workers (302 treatment, 345 control).
- Data: 680,676 chats (115,243 rated).
- Treatment: workers supervised an agentic AI for AI-eligible chats (<10% of volume) and continued to handle AI-ineligible chats themselves. Control: fully human handling.
- Primary outcomes:
- Service speed: average chat duration falls under AI deployment.
- Service quality: large decline in customer ratings for AI-eligible chats; limited change in retrial rates.
- Heterogeneity by escalation type:
- Algorithm-triggered technical escalations (AI cannot resolve issue): human intervention largely preserves service quality.
- Algorithm-triggered emotional escalations (customer frustration/dissatisfaction): human intervention is much less effective; ratings remain low.
- Mechanisms:
- Post-escalation human effort varies by escalation type. In emotional escalations, workers show lower engagement (fewer messages, lower share of rounds, less proactive information seeking/solutioning).
- Timing matters: earlier human takeover (before negative sentiment becomes entrenched) preserves worker effort and improves recovery chances.
- Spillovers and multitasking:
- Positive spillover on AI-ineligible chats: treated workers reallocate attention and multitask less away from these chats, improving speed without materially reducing overall quality.
- Process-design tradeoffs:
- Specialized supervisory roles vs integrated roles: specialization can improve handling of AI-eligible chats but risks emotional-resource depletion and skill erosion; integrated roles generate beneficial workload reallocation but expose firms to lower ratings when interventions are late or in emotional cases.
Data & Methods
- Data sources: worker demographics, session logs (timestamps, IDs, issue categories), full chat transcripts, and worker activity logs (clicks, typing, viewing history).
- Key metrics:
- Service speed: chat duration (mean ≈ 462.36 s).
- Service quality: customer rating (5‑point scale; mean ≈ 3.57; 17% of chats rated) and retrial (contact again within 7 days; 43%).
- Process measures: total chat rounds (mean 7.24), message count (mean 13.82), word count (mean 202.35), response delays (avg reply lag ≈ 14.76 s, cumulative ≈ 107.19 s).
- Multitasking: concurrency (simultaneous chats), total active time on focal chat, away time from focal chat.
- Experimental design:
- Worker-level random assignment to treatment vs control; 14-day pre-treatment baseline, 17-day treatment window.
- Human-in-the-loop deployment: AI handled AI-eligible chats with algorithmic and human triggers for escalation; supervisors could also manually escalate.
- Compensation: piece-rate for chats; AI-eligible chats contributed equivalently to workload and pay calculations.
- Empirical strategy:
- Worker-level difference-in-differences comparing pre/post within workers across treatment and control.
- Subsample analyses by escalation type (algorithm-triggered technical vs emotional; human-initiated escalations), by timing of takeover, and by post-escalation engagement measures derived from transcripts and activity logs.
Implications for AI Economics
- Productivity vs quality trade-offs:
- Agentic AI can increase throughput and reduce time per interaction, but may impose hidden quality externalities (lower customer satisfaction) that standard productivity metrics miss. Economic evaluations of AI adoption must account for both speed gains and quality losses.
- Non-cognitive failure costs matter:
- AI failures can produce emotional spillovers that raise the cost of later human remediation. Models of AI deployment should incorporate emotional-state dynamics and repair costs, not only error-correction probabilities.
- Human-in-the-loop design is endogenous:
- The value of human supervision depends on failure type and intervention timing. Firms should optimize monitoring algorithms (earlier detection), escalation triggers, and front-line workflows to maximize the recoverable portion of AI failures.
- Incentives and effort allocation:
- Workers reduce effort after emotionally charged AI failures. Compensation and task design must incentivize sustained post-escalation effort (e.g., bonuses for recovered ratings, shorter measures of response delay, training for emotional recovery).
- Organizational structure and labor demand:
- Hybrid roles (supervise AI + handle non-AI cases) can produce beneficial reallocation effects but introduce tradeoffs (emotional load, possible skill erosion). Predictions of labor demand and job redesign from AI should include these dynamic effects on task composition and worker skill accumulation.
- Measurement and welfare:
- Studies and cost–benefit analyses should measure downstream effects (retrial, repeat purchases, long-run customer lifetime value) and the distributional impacts across customers and workers. Customer welfare losses from lower ratings may translate into revenue effects beyond immediate service metrics.
- Future empirical priorities for AI economics:
- Longer-run experiments to detect skill dynamics and emotional-depletion effects.
- Heterogeneity across communication modalities (voice vs text), customer types, payment schemes, and AI-eligible-task shares.
- Designing incentive schemes and monitoring policies that mitigate emotional-escalation failures and preserve worker engagement.
- Policy relevance:
- Consumer-protection standards and disclosure rules may need to consider emotional harms from AI interactions and set norms for escalation latency, transparency, and redress mechanisms.
Limitations to keep in mind: AI-eligible chats comprised under 10% of volume in this setting; treatment window was short (17 days); setting is specific to text-based e-commerce chat and Alibaba’s compensation and monitoring systems. Generalization requires testing across other firms, modalities, and longer horizons.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| AI deployment reduces average chat duration. Task Completion Time | negative | high | average chat duration |
1.0
|
| AI deployment has limited effects on retrial rates. Error Rate | null_result | high | retrial rates (repeat contact rate) |
1.0
|
| AI deployment substantially lowers ratings for AI-eligible chats. Output Quality | negative | high | customer ratings for AI-eligible chats |
1.0
|
| Human intervention preserves service quality in algorithm-triggered technical escalations (unresolved customer issues beyond the AI's capability). Output Quality | positive | high | service quality after technical escalations |
0.6
|
| Human intervention is less effective in algorithm-triggered emotional escalations (where customers express frustration or dissatisfaction). Output Quality | negative | high | service quality after emotional escalations |
0.6
|
| Differences in human intervention effectiveness across escalation types are partly explained by variation in workers' post-escalation intervention effort. Task Allocation | mixed | high | post-escalation intervention effort and its mediating role on service outcomes |
0.6
|
| In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision. Task Allocation | negative | high | worker engagement measures (message count, share of chat rounds, proactivity indicators) |
0.6
|
| Early intervention is essential for sustaining high post-escalation intervention effort. Task Allocation | positive | high | post-escalation intervention effort as a function of intervention timing |
0.6
|
| There is a positive spillover effect on AI-ineligible chats: treated workers adapted their multitasking workflow to devote greater attention to these chats. Task Allocation | positive | high | attention/effort devoted to AI-ineligible chats (spillover effect) |
0.6
|