High-quality LLM suggestions markedly improve caseworkers’ SNAP accuracy—by about 27 percentage points—while incorrect suggestions substantially reduce accuracy; gains level off above roughly 80% chatbot accuracy, revealing an 'AI underreliance plateau' that complicates real-world deployment.

LLMs in social services: How does chatbot accuracy affect human accuracy?

Jennah Gosciak, Eric Giannella, Zhaowen Guo, Michael Chen, Allison Koenecke · March 11, 2026

arxiv rct high evidence 9/10 relevance Source PDF

A randomized experiment with nonprofit SNAP caseworkers shows that high-quality LLM chatbot suggestions can raise caseworker accuracy on eligibility questions by ~27 percentage points, but incorrect suggestions can substantially harm performance and human gains plateau as chatbot accuracy increases.

Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.

Summary

Main Finding

LLM-based chatbot suggestions can substantially improve nonprofit caseworkers' accuracy in giving benefits eligibility guidance, but gains depend strongly on chatbot quality. High-quality chatbots (≈96–100% accurate) raised caseworker accuracy by about 27 percentage points, while incorrect chatbot suggestions can substantially harm performance. As chatbot accuracy rises, additional caseworker gains diminish — an "AI underreliance plateau" that limits returns from further model improvements unless human-AI interaction is addressed.

Key Points

Benchmark: The authors built a 770-question multiple-choice benchmark of realistic, difficult questions a caseworker might receive about programs like SNAP.
Experiment design: Randomized controlled trial with nonprofit caseworkers in Los Angeles. Ground truth answers were expert-verified.
Baseline: Control (no chatbot suggestions) mean accuracy = 49%.
Treatment: Chatbot suggestions were shown to caseworkers; the suggestions were artificially varied to have aggregate accuracies between 53% (low) and 100% (high).
Positive effect: High-quality chatbots (96–100% accurate) increased caseworker accuracy by ~27 percentage points relative to control.
Negative effect: Incorrect chatbot suggestions reduced caseworker accuracy substantially; on "easy" questions (where control performance was best), incorrect suggestions caused about a two-thirds reduction in accuracy — evidence of anchoring/automation bias.
Diminishing returns: Improvements in caseworker accuracy level off as chatbot accuracy increases ("AI underreliance plateau"), indicating that even very accurate models may not produce proportional human performance gains.
Implicit tradeoff: Both the probability of model error and human susceptibility to wrong suggestions matter; average model accuracy alone is not sufficient to predict net benefit.

Data & Methods

Dataset: 770 multiple-choice items designed to reflect difficult, realistic eligibility questions for social safety-net programs (e.g., SNAP).
Ground truth: Answers verified by subject-matter experts.
Participants: Caseworkers recruited from nonprofit outreach organizations in Los Angeles (field-relevant users).
Intervention: Randomized assignment to control (no suggestions) or treatment (saw chatbot suggestions). Treatment chatbot outputs were simulated and systematically varied to produce different aggregate accuracies (53% up to 100%) so causal effects of suggestion quality could be measured.
Outcomes: Caseworker answer accuracy overall and at the question level; analysis of how correct vs incorrect suggestions affected performance and how gains scaled with chatbot accuracy.

Implications for AI Economics

Value of accuracy is nonlinear: Returns to investing in model accuracy are large up to a point (substantial gains from low→high accuracy) but exhibit diminishing marginal benefits because humans underuse even highly accurate suggestions. Economic models of AI value should incorporate human adoption/underreliance dynamics, not just model performance.
Externalities of errors: Incorrect suggestions can impose large negative externalities (reduced worker accuracy, potential harm to vulnerable clients). Cost–benefit analyses must weigh the asymmetric harm of errors, particularly in high-stakes public-service settings.
Design and deployment matter economically: Interfaces, uncertainty communication, worker training, and trust calibration (e.g., show confidences, provide explanations, allow easy verification) could increase effective uptake and shift the underreliance plateau, raising realized returns to model improvements.
Incentives & regulation: Procurement and oversight of assistive AI in social services should set accuracy thresholds, require evaluation with real users, and monitor worst-case error modes. Contracts or subsidies to improve UI/education may be as important as spending to improve base model accuracy.
Labor productivity and distributional effects: High-quality assistive AI can raise caseworker productivity and expand access to accurate guidance, especially for uncommon client situations. But intermediate-accuracy systems risk making outcomes worse for clients of less-experienced workers or on straightforward cases if errors anchor decisions.
Measurement for policy: Evaluations for public-sector AI should use user-in-the-loop randomized trials (like this study) and question-level analysis, rather than model-only benchmarks, to estimate true social value and risks.
Research priorities: Study interventions to overcome underreliance (training, calibrated confidence scores, explanation quality), measure downstream client outcomes and welfare, and quantify the economic threshold where additional model accuracy yields negligible incremental social benefit.

Assessment

Paper Typerct Evidence Strengthhigh — Causal identification is strong because of random assignment to see or not see suggestions and randomized variation in chatbot accuracy; outcome measurement is direct (answers on expert-verified benchmark). Limitations that temper external validity (modest sample size, single city/sector, simulated rather than live LLM outputs, and multiple-choice format) prevent labeling this definitive for all real-world deployments, but the internal causal evidence is strong. Methods Rigorhigh — The study constructs an expert-verified 770-question benchmark, confirms answers with multiple experts, uses randomized assignment and multiple treatment accuracy levels, and reports robustness checks (e.g., attention checks, difficulty reclassification); shortcomings include a modest sample (125), non-trivial fraction of participants failing an attention check (authors include all participants), reliance on simulated/hard-coded chatbot suggestions rather than live models, and limited jurisdictional scope (California CalFresh), which the authors acknowledge. Sample125 nonprofit caseworkers recruited from outreach organizations in Los Angeles (average 4.06 years experience); each participant completed a 45-question assessment randomly drawn from a 770-question, expert-verified CalFresh (California SNAP) multiple-choice benchmark; control n=31, treatment n=94 with 10 different aggregate chatbot-accuracy arms (53%–100%); data collected May–July 2025. Themeshuman_ai_collab productivity adoption skills_training governance IdentificationRandomized controlled experiment: participants (n=125 caseworkers) randomly assigned to control (no chatbot suggestions, n=31) or treatment (see chatbot suggestions, n=94) with treatment arms varying the aggregate chatbot accuracy (10 levels from 53% to 100% correct) via hard-coded correct/incorrect suggestions; outcomes are caseworker accuracy on 45-question draws from an expert-verified 770-question SNAP benchmark. GeneralizabilitySample is geographically concentrated (Los Angeles) and limited to nonprofit caseworkers, not state eligibility workers or national samples, Questions are specific to California CalFresh rules and may not generalize to other programs, states, or countries, Treatment used simulated/hard-coded chatbot suggestions rather than live LLMs, so interaction dynamics and timing in real deployments may differ, Multiple-choice benchmark tasks differ from conversational, open-ended client interactions (differences in interface, follow-up, and context), Modest sample size limits precision for subgroup analyses and rare question types, Noncompliance and attention-check failures may affect ecological validity

Claims (10)

Claim	Direction	Confidence	Outcome	Details
We created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Other	null_result	high	benchmark dataset size and content (770 multiple-choice questions)	n=770 770 multiple-choice questions (benchmark size) 1.0
The benchmark questions have corresponding expert-verified answers. Other	null_result	high	availability of expert-verified reference answers for benchmark questions	n=770 expert-verified answers available for benchmark 1.0
We conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Other	null_result	high	execution of a randomized experiment with nonprofit caseworker participants (location: Los Angeles)	randomized experiment conducted (sample unspecified) 1.0
Caseworkers in the control condition (no chatbot suggestions) had a mean accuracy of 49%. Output Quality	null_result	high	caseworker accuracy (mean percent correct in control condition = 49%)	mean accuracy = 49% (control) 1.0
Chatbot suggestions were artificially varied in aggregate accuracy across treatment conditions from low (53%) to high (100%). Output Quality	null_result	high	manipulated chatbot suggestion accuracy (range 53%–100%)	chatbot suggestion accuracy manipulated across 53%–100% 1.0
Caseworker performance significantly improves as chatbot quality improves. Output Quality	positive	high	caseworker accuracy as a function of chatbot suggestion quality	performance improves with chatbot quality (statistically significant) 1.0
High-quality chatbots (96–100% accurate) improved caseworker accuracy by 27 percentage points. Output Quality	positive	high	change in caseworker accuracy (percentage-point increase) when assisted by 96–100% accurate chatbot	27 percentage-point improvement (96–100% accurate chatbot) 1.0
At the question level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best. Output Quality	negative	high	caseworker accuracy on easy questions when presented with incorrect chatbot suggestions (two-thirds reduction)	≈66% reduction in accuracy on easy questions when chatbot suggestion incorrect 1.0
Improvements in caseworker accuracy level off as chatbot accuracy increases (an "AI underreliance plateau"). Output Quality	mixed	medium	marginal improvement in caseworker accuracy as chatbot accuracy increases (diminishing returns / plateau)	diminishing marginal gains (''underreliance plateau'') 0.6
LLM-based chatbots may offer a means to provide better, faster help to nonprofit caseworkers assisting clients with complex program eligibility. Organizational Efficiency	positive	speculative	potential for improved/faster assistance (hypothesized benefit; not directly measured in this excerpt)	potential for better/faster assistance (hypothesized, not directly measured) 0.1