Keeping human and AI judgments independent and resolving disagreements with a second human consistently beats the common AI-advisor model across ten tasks, from medical diagnosis to misinformation detection. The gain stems from people’s poor ability to distinguish correct from incorrect AI advice, so a simple tie-breaking hybrid design improves accuracy and transparency.
Artificial intelligence (AI) is broadly deployed as an advisor to human decision-makers: AI recommends a decision and a human accepts or rejects the advice. This approach, however, has several limitations: People frequently ignore accurate advice and rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Here, we compare the AI-as-advisor approach to the hybrid confirmation tree (HCT), an alternative strategy that preserves the independence of human and AI judgments. The HCT elicits a human judgment and an AI judgment independently of each other. If they agree, that decision is accepted. If not, a second human breaks the tie. For the comparison, we used 10 datasets from various domains, including medical diagnostics and misinformation discernment, and a subset of four datasets in which AI also explained its decision. The HCT outperformed the AI-as-advisor approach in all datasets. The HCT also performed better in almost all cases in which AI offered an explanation of its judgment. Using signal detection theory to interpret these results, we find that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds.
Summary
Main Finding
The hybrid confirmation tree (HCT)—where a human and an AI independently render judgments and a second human breaks ties when they disagree—outperforms the standard AI-as-advisor workflow (AI gives advice and the human accepts/rejects) across a wide range of real-world tasks. Pooled over 10 datasets, the HCT raised accuracy by ~4.45 percentage points (95% HDI 3.73–5.27) versus the AI-as-advisor approach, and it also outperformed explainable-AI (XAI) advisor conditions in most realistic settings.
Key Points
- Datasets and scale: 10 datasets from domains including medical diagnostics (skin cancer, colonoscopy), misinformation/deepfake detection, sentiment and deception in reviews, headline truthfulness, and criminal rearrest prediction; >41,000 human decisions by 1,229 people over 3,220 cases. An XAI subset covered 16 explanation conditions with an additional ~50,390 decisions (1,423 humans, 516 cases).
- Aggregate performance: HCT beat AI-as-advisor in every dataset (per-dataset improvements ranged ~0.2 to 6.6 percentage points). Pooled improvement ≈ 4.45 pp with effectively 100% probability of practical significance (ROPE ±1 pp).
- Mechanism driving gains:
- AI was correct 77% of the time across datasets.
- When human and AI disagree:
- In AI-as-advisor, humans adopt correct AI advice only ~34% of the time (they often stick with their initial judgment), and they reject incorrect AI advice ~80% of the time.
- In HCT, the independent human tiebreaker agrees with a correct-AI choice ~71% of the time (so HCT captures correct AI decisions much more often), but rejects incorrect-AI choices only ~47% of the time (so HCT endorses more incorrect AI decisions relative to advisor rejection).
- Because correct AI advice is much more common, the HCT’s higher uptake of correct AI decisions yields net accuracy gains despite endorsing a larger share of incorrect AI cases.
- Expertise effects: HCT benefits all skill levels, largest gains for lower-skilled individuals (≈ +8 pp for low performers; +3.2 pp mid; +1.9 pp high). The second (tiebreaker) slot is particularly valuable when filled by mid/high performers.
- Explainability (XAI): Across 16 XAI conditions, HCT outperformed the XAI-as-advisor in 11 comparisons, matched in 2, and lost in 3—losses occurred when baseline human accuracy was near chance, making tiebreakers ineffective.
- Operational cost: HCT requires a second human in tie cases. Disagreement triggered tiebreaking in about ~33% of cases on average (range ~22–49% across datasets), increasing human labor per decision.
Data & Methods
- Workflow comparison:
- AI-as-advisor: AI gives a recommendation; human sees it and makes the final choice.
- HCT: Human 1 and AI independently decide; if they agree the decision stands; if they disagree, Human 2 (an independent tiebreaker) chooses the final answer.
- Empirical procedure:
- For each case, the authors generated all pairwise permutations of two human decision-makers (unaided) to simulate HCT outcomes; compared these to observed human behavior with AI advice.
- Analyses covered different AI output formats (labels, confidence, probability) and multiple XAI treatments (heatmaps, examples, top explanations, adaptive explanations).
- Statistics and modeling:
- Bayesian estimation with a region of practical equivalence (ROPE) of ±1 percentage point to assess practical significance.
- Separate models examined performance conditional on whether the AI was correct or incorrect.
- A signal detection theory (SDT) style analytic model was developed to interpret the relative roles of (i) humans’ propensity to rely on AI and (ii) humans’ ability to discriminate between correct and incorrect AI advice. The SDT model reproduced empirical patterns and clarified that poor discrimination and insufficient reliance jointly explain low AI-advice uptake in the advisor workflow.
- Key empirical metrics reported: per-dataset accuracies, pooled effect sizes (pp improvements), tiebreak rates, adoption/rejection rates of AI advice under disagreement, and subgroup (expertise) analyses.
Implications for AI Economics
- Workflow design matters for realized value of AI:
- The economic value of predictive AI depends not only on model accuracy but also on the human–AI interaction protocol. HCT yields measurable, consistent accuracy gains—translating directly into economic value in high-stakes settings (e.g., fewer diagnostic errors, reduced false arrests, less misinformation spread).
- Cost–benefit trade-offs:
- HCT increases per-case human labor in disagreement cases (~tiebreaking 22–49% depending on task). Organizations must weigh the marginal accuracy gain (mean ~4.5 pp pooled; larger for low-expertise workers) against labor costs of additional human involvement. Where mistakes are costly, the accuracy gains likely justify the extra human effort.
- Allocation of scarce human capital:
- Because only a minority of cases require a tiebreaker, it is economically efficient to reserve higher-skilled (and more costly) experts for the second-human role. This leverages scarce expertise to maximize marginal gains—useful for staffing, scheduling, and compensation design.
- Regulation and liability:
- The HCT aligns well with regulatory and ethical demands for human oversight because it preserves human independence and final approval. Regulators and procurement officers should consider mandating or favoring independent-judgment workflows (like HCT) in domains where human accountability and auditability are required.
- Explainability investments:
- The study shows that simple aggregation (HCT) often outperforms investing in explainability for improving human uptake of correct AI advice. From a procurement/investment perspective, firms should not assume XAI always substitutes for better interaction design; HCT is a low-technical-change policy that can sometimes yield larger benefits.
- Market implications for AI product design:
- Vendors might productize HCT-supporting interfaces or services (tools to randomize independent human assessments, pair humans as tiebreakers, or route disagreements to designated experts). Pricing and contracting could reflect reduced downstream error costs rather than just model accuracy.
- Measurement and evaluation changes:
- Cost-benefit evaluations of deployed AI should include (a) how well humans can discriminate AI correctness and (b) human willingness to adopt AI advice. Metrics and audits should track disagreement frequency, tiebreak outcomes, and the conditional adoption/rejection rates—these determine realized performance, not model accuracy alone.
- Deskilling and human capital policy:
- Because the HCT keeps the human decision upstream of AI influence (human makes independent judgment), it mitigates the deskilling risk associated with always-deferring-to-AI advisor workflows. This has implications for training investments, career progression, and long-run human capital maintenance.
- When HCT may not be optimal:
- HCT is less valuable or even harmful when human decision-makers are at chance levels (tiebreakers cannot reliably resolve disagreements). In such settings, organizations should invest in training, improve AI accuracy, or consider alternative workflows.
Summary recommendation: For most practical, high-stakes applications where human decision-makers are reasonably skilled and human labor for occasional tiebreaking is available, switching from a sequential AI-as-advisor workflow to an independent-aggregation approach such as HCT is likely to increase realized accuracy and economic value. Organizations should evaluate the disagreement rate and tiebreaker staffing costs to decide whether and how to implement HCT.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| The hybrid confirmation tree (HCT) elicits a human judgment and an AI judgment independently; if they agree that decision is accepted, and if they disagree a second human breaks the tie. Other | null_result | high | procedure_description |
1.0
|
| The study compared HCT to the AI-as-advisor approach using 10 datasets from various domains, including medical diagnostics and misinformation discernment. Other | null_result | high | dataset_scope |
n=10
1.0
|
| A subset of four datasets included settings in which the AI provided explanations of its decision. Other | null_result | high | presence_of_AI_explanation |
n=4
1.0
|
| The HCT outperformed the AI-as-advisor approach in all datasets. Decision Quality | positive | high | decision accuracy / task performance |
1.0
|
| The HCT also performed better in almost all cases in which the AI offered an explanation of its judgment. Decision Quality | positive | high | decision accuracy when AI provides explanations |
n=4
0.6
|
| Using signal detection theory, the paper finds that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Decision Quality | positive | high | discriminability between correct and incorrect AI advice (signal detection metrics, e.g., d') |
0.6
|
| The AI-as-advisor approach has limitations: people frequently ignore accurate advice, rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Skill Obsolescence | negative | medium | skill deterioration / susceptibility to incorrect advice |
0.36
|
| Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds. Decision Quality | positive | high | overall decision-making performance / robustness / transparency |
0.6
|