High-quality LLM suggestions markedly improve caseworkers’ SNAP accuracy—by about 27 percentage points—while incorrect suggestions substantially reduce accuracy; gains level off above roughly 80% chatbot accuracy, revealing an 'AI underreliance plateau' that complicates real-world deployment.
Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.
Summary
Main Finding
LLM-based chatbot suggestions can substantially improve nonprofit caseworkers' accuracy in giving benefits eligibility guidance, but gains depend strongly on chatbot quality. High-quality chatbots (≈96–100% accurate) raised caseworker accuracy by about 27 percentage points, while incorrect chatbot suggestions can substantially harm performance. As chatbot accuracy rises, additional caseworker gains diminish — an "AI underreliance plateau" that limits returns from further model improvements unless human-AI interaction is addressed.
Key Points
- Benchmark: The authors built a 770-question multiple-choice benchmark of realistic, difficult questions a caseworker might receive about programs like SNAP.
- Experiment design: Randomized controlled trial with nonprofit caseworkers in Los Angeles. Ground truth answers were expert-verified.
- Baseline: Control (no chatbot suggestions) mean accuracy = 49%.
- Treatment: Chatbot suggestions were shown to caseworkers; the suggestions were artificially varied to have aggregate accuracies between 53% (low) and 100% (high).
- Positive effect: High-quality chatbots (96–100% accurate) increased caseworker accuracy by ~27 percentage points relative to control.
- Negative effect: Incorrect chatbot suggestions reduced caseworker accuracy substantially; on "easy" questions (where control performance was best), incorrect suggestions caused about a two-thirds reduction in accuracy — evidence of anchoring/automation bias.
- Diminishing returns: Improvements in caseworker accuracy level off as chatbot accuracy increases ("AI underreliance plateau"), indicating that even very accurate models may not produce proportional human performance gains.
- Implicit tradeoff: Both the probability of model error and human susceptibility to wrong suggestions matter; average model accuracy alone is not sufficient to predict net benefit.
Data & Methods
- Dataset: 770 multiple-choice items designed to reflect difficult, realistic eligibility questions for social safety-net programs (e.g., SNAP).
- Ground truth: Answers verified by subject-matter experts.
- Participants: Caseworkers recruited from nonprofit outreach organizations in Los Angeles (field-relevant users).
- Intervention: Randomized assignment to control (no suggestions) or treatment (saw chatbot suggestions). Treatment chatbot outputs were simulated and systematically varied to produce different aggregate accuracies (53% up to 100%) so causal effects of suggestion quality could be measured.
- Outcomes: Caseworker answer accuracy overall and at the question level; analysis of how correct vs incorrect suggestions affected performance and how gains scaled with chatbot accuracy.
Implications for AI Economics
- Value of accuracy is nonlinear: Returns to investing in model accuracy are large up to a point (substantial gains from low→high accuracy) but exhibit diminishing marginal benefits because humans underuse even highly accurate suggestions. Economic models of AI value should incorporate human adoption/underreliance dynamics, not just model performance.
- Externalities of errors: Incorrect suggestions can impose large negative externalities (reduced worker accuracy, potential harm to vulnerable clients). Cost–benefit analyses must weigh the asymmetric harm of errors, particularly in high-stakes public-service settings.
- Design and deployment matter economically: Interfaces, uncertainty communication, worker training, and trust calibration (e.g., show confidences, provide explanations, allow easy verification) could increase effective uptake and shift the underreliance plateau, raising realized returns to model improvements.
- Incentives & regulation: Procurement and oversight of assistive AI in social services should set accuracy thresholds, require evaluation with real users, and monitor worst-case error modes. Contracts or subsidies to improve UI/education may be as important as spending to improve base model accuracy.
- Labor productivity and distributional effects: High-quality assistive AI can raise caseworker productivity and expand access to accurate guidance, especially for uncommon client situations. But intermediate-accuracy systems risk making outcomes worse for clients of less-experienced workers or on straightforward cases if errors anchor decisions.
- Measurement for policy: Evaluations for public-sector AI should use user-in-the-loop randomized trials (like this study) and question-level analysis, rather than model-only benchmarks, to estimate true social value and risks.
- Research priorities: Study interventions to overcome underreliance (training, calibrated confidence scores, explanation quality), measure downstream client outcomes and welfare, and quantify the economic threshold where additional model accuracy yields negligible incremental social benefit.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Other | null_result | high | benchmark dataset size and content (770 multiple-choice questions) |
n=770
770 multiple-choice questions (benchmark size)
1.0
|
| The benchmark questions have corresponding expert-verified answers. Other | null_result | high | availability of expert-verified reference answers for benchmark questions |
n=770
expert-verified answers available for benchmark
1.0
|
| We conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Other | null_result | high | execution of a randomized experiment with nonprofit caseworker participants (location: Los Angeles) |
randomized experiment conducted (sample unspecified)
1.0
|
| Caseworkers in the control condition (no chatbot suggestions) had a mean accuracy of 49%. Output Quality | null_result | high | caseworker accuracy (mean percent correct in control condition = 49%) |
mean accuracy = 49% (control)
1.0
|
| Chatbot suggestions were artificially varied in aggregate accuracy across treatment conditions from low (53%) to high (100%). Output Quality | null_result | high | manipulated chatbot suggestion accuracy (range 53%–100%) |
chatbot suggestion accuracy manipulated across 53%–100%
1.0
|
| Caseworker performance significantly improves as chatbot quality improves. Output Quality | positive | high | caseworker accuracy as a function of chatbot suggestion quality |
performance improves with chatbot quality (statistically significant)
1.0
|
| High-quality chatbots (96–100% accurate) improved caseworker accuracy by 27 percentage points. Output Quality | positive | high | change in caseworker accuracy (percentage-point increase) when assisted by 96–100% accurate chatbot |
27 percentage-point improvement (96–100% accurate chatbot)
1.0
|
| At the question level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best. Output Quality | negative | high | caseworker accuracy on easy questions when presented with incorrect chatbot suggestions (two-thirds reduction) |
≈66% reduction in accuracy on easy questions when chatbot suggestion incorrect
1.0
|
| Improvements in caseworker accuracy level off as chatbot accuracy increases (an "AI underreliance plateau"). Output Quality | mixed | medium | marginal improvement in caseworker accuracy as chatbot accuracy increases (diminishing returns / plateau) |
diminishing marginal gains (''underreliance plateau'')
0.6
|
| LLM-based chatbots may offer a means to provide better, faster help to nonprofit caseworkers assisting clients with complex program eligibility. Organizational Efficiency | positive | speculative | potential for improved/faster assistance (hypothesized benefit; not directly measured in this excerpt) |
potential for better/faster assistance (hypothesized, not directly measured)
0.1
|