The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Keeping human and AI judgments independent and resolving disagreements with a second human consistently beats the common AI-advisor model across ten tasks, from medical diagnosis to misinformation detection. The gain stems from people’s poor ability to distinguish correct from incorrect AI advice, so a simple tie-breaking hybrid design improves accuracy and transparency.

Beyond AI advice -- independent aggregation boosts human-AI accuracy
Julian Berger, Pantelis P. Analytis, Ville Satopää, Ralf H. J. M. Kurvers · March 31, 2026
arxiv rct medium evidence 7/10 relevance Source PDF
A Hybrid Confirmation Tree that elicits independent human and AI judgments and uses a second human to resolve disagreements yields higher decision accuracy than the standard AI-as-advisor approach across ten datasets (including cases with AI explanations).

Artificial intelligence (AI) is broadly deployed as an advisor to human decision-makers: AI recommends a decision and a human accepts or rejects the advice. This approach, however, has several limitations: People frequently ignore accurate advice and rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Here, we compare the AI-as-advisor approach to the hybrid confirmation tree (HCT), an alternative strategy that preserves the independence of human and AI judgments. The HCT elicits a human judgment and an AI judgment independently of each other. If they agree, that decision is accepted. If not, a second human breaks the tie. For the comparison, we used 10 datasets from various domains, including medical diagnostics and misinformation discernment, and a subset of four datasets in which AI also explained its decision. The HCT outperformed the AI-as-advisor approach in all datasets. The HCT also performed better in almost all cases in which AI offered an explanation of its judgment. Using signal detection theory to interpret these results, we find that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds.

Summary

Main Finding

The hybrid confirmation tree (HCT)—where a human and an AI independently render judgments and a second human breaks ties when they disagree—outperforms the standard AI-as-advisor workflow (AI gives advice and the human accepts/rejects) across a wide range of real-world tasks. Pooled over 10 datasets, the HCT raised accuracy by ~4.45 percentage points (95% HDI 3.73–5.27) versus the AI-as-advisor approach, and it also outperformed explainable-AI (XAI) advisor conditions in most realistic settings.

Key Points

  • Datasets and scale: 10 datasets from domains including medical diagnostics (skin cancer, colonoscopy), misinformation/deepfake detection, sentiment and deception in reviews, headline truthfulness, and criminal rearrest prediction; >41,000 human decisions by 1,229 people over 3,220 cases. An XAI subset covered 16 explanation conditions with an additional ~50,390 decisions (1,423 humans, 516 cases).
  • Aggregate performance: HCT beat AI-as-advisor in every dataset (per-dataset improvements ranged ~0.2 to 6.6 percentage points). Pooled improvement ≈ 4.45 pp with effectively 100% probability of practical significance (ROPE ±1 pp).
  • Mechanism driving gains:
    • AI was correct 77% of the time across datasets.
    • When human and AI disagree:
      • In AI-as-advisor, humans adopt correct AI advice only ~34% of the time (they often stick with their initial judgment), and they reject incorrect AI advice ~80% of the time.
      • In HCT, the independent human tiebreaker agrees with a correct-AI choice ~71% of the time (so HCT captures correct AI decisions much more often), but rejects incorrect-AI choices only ~47% of the time (so HCT endorses more incorrect AI decisions relative to advisor rejection).
    • Because correct AI advice is much more common, the HCT’s higher uptake of correct AI decisions yields net accuracy gains despite endorsing a larger share of incorrect AI cases.
  • Expertise effects: HCT benefits all skill levels, largest gains for lower-skilled individuals (≈ +8 pp for low performers; +3.2 pp mid; +1.9 pp high). The second (tiebreaker) slot is particularly valuable when filled by mid/high performers.
  • Explainability (XAI): Across 16 XAI conditions, HCT outperformed the XAI-as-advisor in 11 comparisons, matched in 2, and lost in 3—losses occurred when baseline human accuracy was near chance, making tiebreakers ineffective.
  • Operational cost: HCT requires a second human in tie cases. Disagreement triggered tiebreaking in about ~33% of cases on average (range ~22–49% across datasets), increasing human labor per decision.

Data & Methods

  • Workflow comparison:
    • AI-as-advisor: AI gives a recommendation; human sees it and makes the final choice.
    • HCT: Human 1 and AI independently decide; if they agree the decision stands; if they disagree, Human 2 (an independent tiebreaker) chooses the final answer.
  • Empirical procedure:
    • For each case, the authors generated all pairwise permutations of two human decision-makers (unaided) to simulate HCT outcomes; compared these to observed human behavior with AI advice.
    • Analyses covered different AI output formats (labels, confidence, probability) and multiple XAI treatments (heatmaps, examples, top explanations, adaptive explanations).
  • Statistics and modeling:
    • Bayesian estimation with a region of practical equivalence (ROPE) of ±1 percentage point to assess practical significance.
    • Separate models examined performance conditional on whether the AI was correct or incorrect.
    • A signal detection theory (SDT) style analytic model was developed to interpret the relative roles of (i) humans’ propensity to rely on AI and (ii) humans’ ability to discriminate between correct and incorrect AI advice. The SDT model reproduced empirical patterns and clarified that poor discrimination and insufficient reliance jointly explain low AI-advice uptake in the advisor workflow.
  • Key empirical metrics reported: per-dataset accuracies, pooled effect sizes (pp improvements), tiebreak rates, adoption/rejection rates of AI advice under disagreement, and subgroup (expertise) analyses.

Implications for AI Economics

  • Workflow design matters for realized value of AI:
    • The economic value of predictive AI depends not only on model accuracy but also on the human–AI interaction protocol. HCT yields measurable, consistent accuracy gains—translating directly into economic value in high-stakes settings (e.g., fewer diagnostic errors, reduced false arrests, less misinformation spread).
  • Cost–benefit trade-offs:
    • HCT increases per-case human labor in disagreement cases (~tiebreaking 22–49% depending on task). Organizations must weigh the marginal accuracy gain (mean ~4.5 pp pooled; larger for low-expertise workers) against labor costs of additional human involvement. Where mistakes are costly, the accuracy gains likely justify the extra human effort.
  • Allocation of scarce human capital:
    • Because only a minority of cases require a tiebreaker, it is economically efficient to reserve higher-skilled (and more costly) experts for the second-human role. This leverages scarce expertise to maximize marginal gains—useful for staffing, scheduling, and compensation design.
  • Regulation and liability:
    • The HCT aligns well with regulatory and ethical demands for human oversight because it preserves human independence and final approval. Regulators and procurement officers should consider mandating or favoring independent-judgment workflows (like HCT) in domains where human accountability and auditability are required.
  • Explainability investments:
    • The study shows that simple aggregation (HCT) often outperforms investing in explainability for improving human uptake of correct AI advice. From a procurement/investment perspective, firms should not assume XAI always substitutes for better interaction design; HCT is a low-technical-change policy that can sometimes yield larger benefits.
  • Market implications for AI product design:
    • Vendors might productize HCT-supporting interfaces or services (tools to randomize independent human assessments, pair humans as tiebreakers, or route disagreements to designated experts). Pricing and contracting could reflect reduced downstream error costs rather than just model accuracy.
  • Measurement and evaluation changes:
    • Cost-benefit evaluations of deployed AI should include (a) how well humans can discriminate AI correctness and (b) human willingness to adopt AI advice. Metrics and audits should track disagreement frequency, tiebreak outcomes, and the conditional adoption/rejection rates—these determine realized performance, not model accuracy alone.
  • Deskilling and human capital policy:
    • Because the HCT keeps the human decision upstream of AI influence (human makes independent judgment), it mitigates the deskilling risk associated with always-deferring-to-AI advisor workflows. This has implications for training investments, career progression, and long-run human capital maintenance.
  • When HCT may not be optimal:
    • HCT is less valuable or even harmful when human decision-makers are at chance levels (tiebreakers cannot reliably resolve disagreements). In such settings, organizations should invest in training, improve AI accuracy, or consider alternative workflows.

Summary recommendation: For most practical, high-stakes applications where human decision-makers are reasonably skilled and human labor for occasional tiebreaking is available, switching from a sequential AI-as-advisor workflow to an independent-aggregation approach such as HCT is likely to increase realized accuracy and economic value. Organizations should evaluate the disagreement rate and tiebreaker staffing costs to decide whether and how to implement HCT.

Assessment

Paper Typerct Evidence Strengthmedium — Consistent performance improvements for HCT across ten diverse datasets and use of signal-detection theory strengthen internal validity, but external validity is limited by the set of tasks, unspecified participant populations and incentives, limited range of AI models, and absence of field or longitudinal evidence on real-world adoption and dynamics. Methods Rigormedium — The study uses a clear, reproducible protocol, multiple datasets, and formal analysis (signal detection theory) to interpret mechanisms, which indicates good rigor; however, important details that affect rigor are not reported here or may be limited (sample representativeness, randomization procedures and balance checks, pre-registration, specifics of AI models and explanation designs), reducing confidence in broader robustness. SampleTen labeled datasets spanning domains such as medical diagnostics and misinformation detection; human decision-makers provided judgments (protocol-dependent) and AI systems supplied predictions (and in four datasets, explanations); disagreements resolved by a second human in HCT. Exact participant recruitment, sample sizes, expertise levels, AI architectures, and incentives are not specified in the summary. Themeshuman_ai_collab org_design IdentificationRandomized controlled comparison of two decision protocols: (1) AI-as-advisor, where AI recommends and a human accepts/rejects, and (2) Hybrid Confirmation Tree (HCT), where human and AI judgments are elicited independently and a second human breaks ties when they disagree; outcomes compared across ten tasks/datasets (with a four-dataset subset for explanation conditions). Causal inference relies on experimental assignment to protocol and within-task comparisons of decision accuracy. GeneralizabilityLimited domain coverage (medical diagnostics and misinformation primarily) — other tasks may behave differently, Unclear representativeness of human raters (crowdworkers vs trained professionals) affecting transfer to expert settings, Specific AI models and explanation methods used may not reflect state-of-the-art or production systems, Controlled experimental setting may not capture real-world workflows, incentives, time pressure, or costs, Static datasets may fail to capture dynamic environments where AI and humans co-adapt over time

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
The hybrid confirmation tree (HCT) elicits a human judgment and an AI judgment independently; if they agree that decision is accepted, and if they disagree a second human breaks the tie. Other null_result high procedure_description
1.0
The study compared HCT to the AI-as-advisor approach using 10 datasets from various domains, including medical diagnostics and misinformation discernment. Other null_result high dataset_scope
n=10
1.0
A subset of four datasets included settings in which the AI provided explanations of its decision. Other null_result high presence_of_AI_explanation
n=4
1.0
The HCT outperformed the AI-as-advisor approach in all datasets. Decision Quality positive high decision accuracy / task performance
1.0
The HCT also performed better in almost all cases in which the AI offered an explanation of its judgment. Decision Quality positive high decision accuracy when AI provides explanations
n=4
0.6
Using signal detection theory, the paper finds that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Decision Quality positive high discriminability between correct and incorrect AI advice (signal detection metrics, e.g., d')
0.6
The AI-as-advisor approach has limitations: people frequently ignore accurate advice, rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Skill Obsolescence negative medium skill deterioration / susceptibility to incorrect advice
0.36
Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds. Decision Quality positive high overall decision-making performance / robustness / transparency
0.6

Notes