The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Human–AI teams beat either humans or models alone, but miscalibrated trust still costs them: participants missed 3.9% of correct AI suggestions and followed misleading AI 1.7% of the time, especially when the AI confirmed a human's initial wrong answer.

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?
Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber · May 27, 2026
arxiv correlational medium evidence 7/10 relevance Source PDF
In a competitive QA game, human–AI teams outperform either humans or AI alone, but humans both under-rely on correct AI suggestions (missed 3.9% of opportunities) and over-rely on misleading AI (1.7%), with confirmation bias and weak confidence calibration driving errors.

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

Summary

Main Finding

In a live competitive trivia tournament pairing 23 expert humans with 16 heterogeneous AI agents (24 matches; 20 tossups + 20 three-part bonuses per game), human–AI teams outperformed either humans or AI alone, but humans systematically miscalibrated trust. Two distinct reliance modes matter: proactive delegation (muting AI for tossups) and deliberative adoption (accepting or rejecting AI suggestions in bonuses). Under-reliance—failing to adopt correct AI output—was more common (3.9% of opportunities missed) than over-reliance (1.7% of cases where AI misled teams). Key drivers of errors were (a) poorly calibrated model confidence (confidence is near chance when humans and AI disagree) and (b) human confirmation bias (when an incorrect human guess is confirmed by an AI, teams switch away from correcting that error ~64.5% of the time). Explanation quality and calibrated confidence reduce some mistakes; teams also learned across rounds and sometimes reached correct answers even when neither side was initially correct.

Key Points

  • Experimental scale and outcomes

    • Participants: 23 experienced trivia players; 16 distinct AI agents.
    • Matches: 24 games (each with 20 tossups and 20 three-part bonuses).
    • Behavioral traces: 387 delegation (tossup/muting) decisions and 1,440 adoption (bonus) decisions recorded.
    • Collaboration produced better accuracy than either humans or AIs alone; teams solved 5.5% of items where neither humans nor AI were initially correct.
  • Delegation (tossup / muting)

    • Humans buzzed before AI in 17.9% of tossups.
    • Human buzzes had lower error rates than AIs (20.0% incorrect vs 29.4%).
    • Teams could mute AI teammates between rounds; muting behavior was context-sensitive (muting rates varied 30–100% across round themes).
    • Muting generally helped: eight of nine teams gained net tossup points by muting vs leaving AIs active; teams captured ~79% of an oracle’s potential muting benefit.
    • Calibration gaps: only ~9% of muting decisions were optimal; many teams muted later than an oracle would recommend (late muting was common), indicating imperfect timing of delegation.
  • Adoption (bonus / switching)

    • Bonus workflow: humans give an initial consensus guess, then see two AI answers with confidence scores and textual explanations, then give a final answer.
    • Net error asymmetry: under-reliance (missing correct AI suggestions) was more frequent and costlier than over-reliance.
    • Confirmation bias effect: when an AI agreed with an initially incorrect human guess, teams adopted the incorrect answer far more often (under-reliance ~64.5% in these cases).
    • Confidence calibration fails in disagreement: relying on reported model confidence to pick the right answer when humans and AI differ performed near chance.
    • Explanation usefulness: evidence-grounded explanations (those citing specific question clues) increased abandonment of incorrect human answers by ~12 percentage points.
    • Teams improved across rounds: inaccuracies in bonus decisions fell from 28% to 18%.
  • AI heterogeneity and drafting

    • Agents were architecturally diverse (single-call prompts to multi-step verification pipelines; base models included GPT-4.1, Claude 3.5, etc.), with agent accuracies from ~30–80% on the question set.
    • Teams drafted two AI teammates per game (opaque nicknames), letting experience with each agent inform future drafting—this simulates realistic selection/adoption dynamics rather than a single fixed system.

Data & Methods

  • Task and domain

    • Competitive Quizbowl-style trivia (tossups: interruptible, time-pressured single-answer buzzes; bonuses: three-part-team questions with deliberation).
    • Adversarial question design: human-in-the-loop authorship to produce questions that are non-trivial for both humans and current AIs and that exploit complementary strengths/weaknesses.
  • Experimental setup

    • Two tournaments (one in-person, one online), 24 games total across themed packets targeting known weaknesses.
    • Drafting: serpentine draft of AI teammates each round; teams observe AI behavior over rounds and adjust choices.
    • Interaction protocols: muting option for AIs (affects tossup buzzing only); bonus phase records human initial guess, two AI suggestions (with confidence + free-text explanation), and final team consensus.
  • Measurements and analysis

    • Logged behavioral traces: who buzzed and when, mute states, AI answers/confidences/explanations, initial/final human answers, and correctness.
    • Counterfactual & oracle analyses for muting: estimated what would have happened under alternative muting policies and computed an oracle ceiling (perfect foresight).
    • Statistical tests: chi-squared and McNemar’s tests used for key comparisons; effect sizes reported in paper.
    • Agent diversity: 16 agents built via an open competition, varying base models and pipelines; full specs in appendix.

Implications for AI Economics

  • Trust, adoption, and deployment decisions are dynamic economic choices

    • Delegation (letting AI act autonomously) and adoption (accepting AI output after review) are distinct economic actions with different risk/benefit profiles. Platforms, firms, and policymakers should treat them separately when modeling adoption costs and benefits, designing contracts, or measuring labor substitution/complementarity.
  • Calibration and information design affect effective value

    • Poorly calibrated confidence signals substantially reduce the informational value of AI output in mixed human–AI decision settings. For markets and product managers, investing in calibration (e.g., better probability estimates, uncertainty quantification) yields direct economic returns by enabling better delegation and higher-value adoption decisions.
  • Explanations have measurable economic benefit but must be evidence-grounded

    • Explanations that cite concrete evidence from the input increase correct switching. For firms selling AI tools to skilled users, emphasizing evidence-grounded explainability can raise effective productivity and reduce error costs; conversely, surface-level explanations (quotes, style cues) may generate misplaced trust and should not be treated as sufficient.
  • Heterogeneous agent markets and selection frictions

    • Allowing users to draft/select among heterogeneous agents mirrors real-market choice among AI vendors. Selection frictions, learning-by-observing, and reputation dynamics matter: weaker teams were given earlier picks to offset skill gaps (serpentine draft), showing that allocation mechanisms and market design (pricing, bundling, reputational signals) shape access to high-quality AI and thus welfare outcomes.
  • Labor complementarities and skill premia

    • The observed complementarity (questions that are easier for humans vs AI) implies tasks can be decomposed to exploit comparative advantages rather than full substitution. Economically, this supports models where AI raises productivity of skilled workers (task augmentation), but also highlights vulnerability: high-skill workers displayed stronger confirmation bias, suggesting that augmentative gains may be limited by cognitive biases unless addressed via UI/controls and training.
  • Mechanism and contract design to align incentives

    • Penalties for wrong tossup answers (-5 points) changed delegation incentives; similar cost structures (liability, insurance, performance-based pay) will shape real-world delegation. Firms should design contracts and platform incentives to internalize miscalibration risks—e.g., trial periods, performance-based pricing, insurance layers for high-stakes delegation.
  • Policy and standardization priorities

    • Regulators and standards bodies aiming to reduce systemic risks should: (1) require or incentivize model confidence calibration reporting, (2) standardize evidence-grounded explanation formats that demonstrably improve human correction, and (3) support mechanisms for dynamic user learning (auditable logs, transparent agent histories) because users update beliefs based on observed agent behavior.

Practical recommendations for product managers and economists modeling adoption: - Prioritize calibrated confidence estimates and measure their impact on downstream human decisions. - Provide evidence-grounded explanations (pointer to supporting clues or sources) rather than only surface features or stylized rationales. - Design interfaces that separate delegation choices from post-hoc adoption (different signals and controls), and track per-agent performance to support selection markets. - Account for human cognitive biases (confirmation bias, overconfidence in experts) in training, UI nudges, and contract incentives. - Use heterogeneous-agent auctions/drafts or reputation systems to allocate scarce, high-quality AI capacity in a way that balances equity and efficiency.

Overall, the paper shows that improving AI economics (adoption, pricing, contract design, labor impacts) depends less on raw model accuracy and more on the information design (calibration and explanations), selection mechanisms, and behavioral responses of human users.

Assessment

Paper Typecorrelational Evidence Strengthmedium — Direct behavioral data from human–AI interactions provide concrete evidence about reliance patterns, but the sample is small, non-randomized, and collected in a specific competitive game setting limiting causal interpretation and external validity. Methods Rigormedium — Careful logging of delegation and adoption decisions across many trials and use of expert participants strengthen internal measurement, but there is no clear randomization or pre-registered intervention, limited sample size, and possible selection and task framing biases. Sample24 matches pairing 23 expert human participants with 16 AI agents in a question-answering competitive game, yielding 387 delegation decisions and 1,440 adoption decisions; participants are described as experts but demographic/selection details and AI model specifics are not provided in the summary. Themeshuman_ai_collab adoption GeneralizabilitySmall sample of expert participants may not represent general workers or novices, Laboratory/competitive game environment may not reflect real-world workplace tasks or longitudinal interactions, AI agents used in study may differ from deployed production systems in capability and interface, Short-term interactions; does not capture learning or trust dynamics over extended use, Cultural, organizational, and domain-specific factors that shape trust are not represented

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
We ran 24 matches pairing 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. Adoption Rate null_result high delegation and adoption decisions
n=23
387 delegation decisions; 1440 adoption decisions
0.3
Human–AI collaboration performs better than either AI or humans alone. Team Performance positive high team performance (win rate/accuracy) of human–AI collaboration compared to AI-only and human-only
n=24
0.3
Humans under-rely on correct AI suggestions, missing 3.9% of opportunities. Adoption Rate negative high rate of missed correct AI suggestions (under-reliance)
n=1440
3.9% of opportunities missed
0.3
Humans over-rely on AI when AI misleads them, occurring in 1.7% of opportunities. Adoption Rate negative high rate of over-reliance on incorrect AI suggestions
n=1440
1.7% of opportunities
0.3
Both humans and AI contribute wrong answers. Error Rate negative high contribution of incorrect answers by humans and by AI
0.3
Reported model confidence is near chance when humans and AI disagree. Decision Quality null_result medium calibration/informativeness of model-reported confidence during disagreements
0.09
Confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. Decision Quality negative medium under-reliance rate conditional on AI agreeing with human's initial incorrect answer
64.5%
0.18
To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust. Ai Safety And Ethics positive high improvements in human–AI trust and collaboration (proposed, not empirically tested here)
0.05

Notes