Output Quality
Evidence strength: Mixed — several RCTs show sizable quality gains from high-accuracy AI and user training, while many domains show variable or null effects and much of the evidence is observational
Bottom Line
High-accuracy AI assistance increases human accuracy on complex judgment tasks, and training users to work with AI raises performance beyond mere access Gosciak (2026), Chen (2026). The biggest caveats are that incorrect AI suggestions can sharply reduce quality, and in some technical domains measured gains are small or absent despite widespread adoption Gosciak (2026), Jost (2026).
What This Means in Practice
- Train for calibrated AI use, not just access. Structured training raised performance; untrained access did not, and bad suggestions reduced accuracy when users over-trusted them Chen (2026), Gosciak (2026). Budget for onboarding and misuse mitigation, not just licenses.
- Build verification into workflows when AI can be wrong. When the chatbot was wrong, human accuracy fell, and gains flatten even at very high AI accuracy Gosciak (2026). Require evidence checks, source links, or second-pass reviews on high-stakes outputs.
- Match prompt structure to task ambiguity. Use structured intent when goals are fuzzy; keep prompts simple when goals are clear. Benefits concentrated in ambiguous tasks and were small or negative in low-ambiguity tasks Gang (2026), Gang (2026).
- In software engineering, expect mixed quality impacts and test locally. Secure coding did not improve with AI assistance; code-review models missed most human-flagged issues and degraded with more context; injected “skills” often did not help and sometimes harmed; workflow scaffolding choices changed results modestly Jost (2026), Kumar (2026), Han (2026), Peng (2026). Run A/Bs in your repo and adopt guardrails before scaling.
- Favor human-in-the-loop models for nuanced judgments. Fully automated customer service and underwriting can trade off service quality and trust; AI misses localized factors; agent code contributions are associated with more churn Horn (2026), Naik (2026), Popescu (2026). Keep expert review on nuanced calls and monitor rework rates.
What the Research Finds
Human–AI collaboration quality hinges on AI accuracy and user training
- In a randomized controlled trial (RCT) with social-service caseworkers, top-tier chatbot accuracy raised human accuracy by 27 percentage points; when the bot was wrong, human accuracy fell sharply, especially on easy questions Gosciak (2026).
- Training to work with a large language model (LLM) was associated with higher exam scores than untrained access; optional access without training did not improve scores and led to shorter answers Chen (2026).
- Human gains plateaued even with very accurate AI, suggesting under-reliance or uncorrected human errors limit upside Gosciak (2026).
Interfaces and structured prompting shape alignment, with domain-dependent payoffs
- Structured intent formats improved goal alignment in ambiguous business-analysis tasks but reversed in low-ambiguity travel tasks Gang (2026).
- Across three structured frameworks, goal alignment was similarly high, implying task decomposition—not the specific template—drove results Gang (2026).
- Scoring bias: unconstrained prompts inflated constraint-adherence scores, masking structured prompting’s practical value Gang (2026).
- Prompt design changes shifted agent accuracy, efficiency, coordination, and error rates in simulations Khan.
Software engineering quality: limited gains and surprising failure modes
- An RCT found using Gemini did not increase secure coding quality versus no AI; programming experience improved security and could not be substituted by the tool Jost (2026).
- On a pull-request (PR) benchmark, eight frontier models detected only 15–31% of human-flagged issues; performance worsened as more context was provided, and the top four models were statistically indistinguishable Kumar (2026).
- Injecting prebuilt “skills” into coding agents yielded limited benefits: most skills showed no improvement and several degraded performance due to context/version mismatches Han (2026).
- A smart-contract security benchmark reported no end-to-end exploit success on contamination-free incidents; workflow scaffolding choices shifted results by a few percentage points Peng (2026).
Labeling and calibration pipelines can raise downstream model reliability
- A field RCT in rare-event labeling found that eliciting probabilities and applying a simple statistical recalibration at worker and crowd levels improved classification performance and probability calibration Epping (2026).
- Higher-quality labels improved the out-of-sample reliability of a convolutional neural network (CNN) trained on them Epping (2026).
Domain-specific outcomes span empathy, search quality, and professional judgment
- LLM-generated replies frequently scored as more empathic than human-written responses in blinded evaluations Kumar (2026).
- Generative search improved manual page “good rate” by 1.65% at scale Chen (2026).
- In real estate underwriting, AI handled computations but missed nuanced market factors, supporting a hybrid workflow Naik (2026).
- In the wild, agent code contributions were associated with more churn (rework) over time than human-authored code Popescu (2026).
- Giving models web access degraded predictions for already-accurate models while modestly helping weaker ones; models did better on common topics than specialized health data Hobor (2026).
What We Still Don't Know
- Why more context degrades LLM code review performance is unknown; studies document the drop but do not identify mechanisms Kumar (2026).
- How well structured prompting gains travel to real settings is unclear; many tests rely on LLM judges and show scoring asymmetries with unconstrained prompts Gang (2026).
- Long-run, field-based causal evidence on AI’s impact on production software quality and maintenance burdens is sparse relative to short-run labs and benchmarks Popescu (2026), Peng (2026).
- Whether calibrated training can fully offset the large accuracy losses from incorrect AI suggestions in public-service decisions is not established beyond single-session RCTs Gosciak (2026), Chen (2026).