Consulting small panels of AIs improves decision accuracy compared with a single AI, but adding more systems yields no extra benefit; unanimous AI advice encourages harmful conformity while a lone dissent tempers it, and human-like interfaces boost perceived usefulness without increasing pressure to conform.

More Isn't Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice

Yuta Tsuchiya, Yukino Baba · March 23, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

Small panels of AIs improve human decision accuracy relative to a single AI, unanimous AI consensus increases overreliance, a single dissent reduces conformity pressure, wide disagreement causes confusion, and human-like presentation raises perceived usefulness without increasing conformity.

Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.

Summary

Main Finding

Small panels of diverse AIs (e.g., 3 advisors) improve human decision accuracy relative to a single AI, but increasing panel size further (e.g., to 5) yields no incremental accuracy gains. The pattern of opinions within a panel matters critically: high consensus among AIs encourages overreliance (conformity) and can harm accuracy; a single dissent reduces conformity pressure and helps calibration; wide disagreement increases confusion and undermines appropriate use of advice. Making AI presentations more human-like raises perceived usefulness and agency for some tasks but does not systematically increase conformity or performance.

Key Points

Panel size
- 3-AI panels improved accuracy over a single AI.
- 5-AI panels produced no further accuracy improvement (diminishing returns; non‑linear relationship).
Within-panel consensus
- High consensus across AIs fosters overreliance (informational conformity) and can reduce human corrective behavior.
- One dissenting opinion in an otherwise-consensus panel reduces conformity pressure and promotes more appropriate reliance.
- Strong disagreement (wide spread) increases cognitive load/confusion, lowering effective reliance and performance.
Human-likeness / presentation style
- Human-like tone/persona increased subjective usefulness and sense of agency in some tasks.
- Human-likeness showed individual differences in conformity effects but did not, on average, worsen conformity or objective performance.
Practical interpretation: “More AI advisors” is not unambiguously better — the number, distribution of opinions, and presentation format jointly determine whether multi-AI advice improves or undermines decision quality.

Data & Methods

Design
- Two studies (between- and within-subject elements) with a total N = 348 participants (260 in Study 1; 88 in Study 2).
- Tasks: three binary prediction tasks — Income (UCI Adult), Recidivism (COMPAS), and Dating (speed-dating dataset). Tasks chosen to be moderately difficult (human-only accuracy ~60–70%).
- Procedure: Judge-Advisor System (JAS) — participants give an initial prediction and confidence, receive AI advice(s) with natural-language explanations and attention checks, then give a final prediction and confidence.
Manipulations
- Study 1: Panel size varied between participants (1, 3, or 5 AIs); within-panel consensus varied across trials (full consensus, single dissent, wide disagreement).
- Study 2: Fixed 3-AI panels; varied anthropomorphic/human-like presentation of AI explanations/personas.
AI advisors & explanations
- Advisors simulated by sampling individual decision trees from a random-forest Rashomon set calibrated to 70% accuracy (selected trees each had 35/50 correct on test cases).
- Explanations generated by extracting top-3 SHAP features per tree and converting these into concise natural-language explanations via GPT-4o (LLM-in-the-loop explanation pipeline).
- In multi-AI conditions, models were sampled per participant to create natural variation in predicted labels and rationales.
Measures & quality control
- Primary outcomes: change in accuracy from initial to final decision (reliance/adjustment), confidence, subjective measures (perceived usefulness, agency), and conformity indicators.
- Attention checks and response-quality screening; participants restricted to adults in Asia fluent in Japanese; no external AI use allowed.

Implications for AI Economics

Product design and platform strategy
- Optimal product offerings: platforms should not assume more models always add value. Small curated panels (e.g., 3 diverse models) may maximize user accuracy and satisfaction while minimizing cognitive cost.
- Interface design matters economically: presenting one dissenting opinion or highlighting minority views can be a low-cost design lever to reduce harmful conformity and improve decision quality.
- Aggregation vs. plural presentation: naive display of many model outputs can create information overload or herding; platforms should offer summarized statistics (majority vote, calibrated uncertainty) alongside curated dissent.
Market structure and model differentiation
- Value of true diversity: economic welfare gains from a marketplace of models depend on model independence. If many providers are trained on similar data, apparent consensus can create systemic herding risk (correlated errors) and mislead users.
- Product heterogeneity matters: incentives for model providers to differentiate (methodologically or data-wise) can increase the social value of multi-AI advice, but unchecked proliferation without meaningful diversity yields little marginal benefit.
Consumer behavior, adoption, and willingness to pay
- Human-likeness increases perceived usefulness and agency, which may raise adoption or willingness to pay even when objective accuracy is unaffected — implying firms can monetize presentation/UX separately from model performance.
- However, increased perceived usefulness may encourage more reliance; regulators and firms should ensure that perception aligns with calibrated performance to avoid overtrust.
Welfare, externalities, and regulation
- Herding risk: high apparent consensus among deployed models can induce conformity and amplify errors at scale (systemic risk). Regulators should consider disclosure of model provenance, correlation, and uncertainty.
- Mandatory or recommended UI features (e.g., visible minority opinions, uncertainty bands, or model provenance labels) could mitigate conformity externalities and improve aggregate decision quality.
- Procurement and evaluation: institutions buying multi-AI decision support should evaluate panels for diversity and robustness, not just raw model count or averaged accuracy.
Pricing and bundling
- Bundles of complementary (diverse) models are more valuable than bundles of many similar ones. Pricing strategies should reflect marginal value of additional models (diminishing returns) and the cost of potential conformity harms.
Labor-market & competition effects
- Multi-AI setups that appear more persuasive (human-like) may displace certain advisory roles faster. Conversely, systems that provide curated dissent could support human experts by improving calibration.
Research & measurement implications for economists
- When modeling information aggregation markets, incorporate behavioral conformity and cognitive cost parameters: the social value of an additional advisor is endogenous to users’ conformity responses and interface design.
- Policy evaluations (e.g., mandated labelings, audit regimes) should account for how presentation and within-panel consensus shape human users’ reliance, not only model accuracy metrics.

Caveats - Laboratory/crowdsourced setting with simulated decision trees and LLM-generated explanations; external validity to high-stakes real-world settings needs further field work. - Participant pool, language, and task scope (three binary tasks) limit generalizability; model-training correlations in real markets may be stronger than simulated here. - The AIs were calibrated to ~70% accuracy; results could differ with much stronger or weaker models.

Overall takeaway for AI economics: count of AI advisors is an imperfect proxy for decision quality. Market and platform designers should focus on curated diversity, dissent visibility, calibrated explanations, and interface features that mitigate conformity to capture the genuine economic value of multi-AI advice.

Assessment

Paper Typerct Evidence Strengthmedium — Strong internal validity from randomized manipulation of key treatment variables across multiple tasks gives credible causal estimates about how panel features affect decision-making; however, external validity is limited by the specific lab/online tasks, likely convenience sample, short-term interactions, and absence of real-world stakes or longitudinal follow-up. Methods Rigormedium — Design uses well-chosen, orthogonal manipulations and multiple tasks, which increases robustness; but the writeup does not report (in the provided summary) pre-registration, sample sizes, power calculations, or field/real-world validation, and there may be task- or platform-specific demand effects and limited heterogeneity analysis. SampleOnline experimental participants completing three decision-making tasks who received AI panel advice under randomized conditions varying panel size, within-panel consensus, and presentation style; outcome measures include decision accuracy, reliance on AI advice, and perceived usefulness/agency (demographic details, recruitment platform, and sample size not specified in the summary). Themeshuman_ai_collab org_design productivity IdentificationRandomized controlled experiments: participants were randomly assigned across factorial manipulations of panel size (single AI vs small vs large panels), within-panel consensus (high consensus, single dissent, wide disagreement), and presentation style (human-like vs generic); causal effects are identified by comparing outcomes (decision accuracy, reliance measures, perceived usefulness/agency) across these randomized conditions. GeneralizabilityTasks are artificial/low-stakes relative to many real-world decision contexts (limited ecological validity), Likely convenience online sample (e.g., MTurk/Prolific or lab students), limiting demographic representativeness, Short-term exposure; longer-term learning or adaptation effects unobserved, Specific types of AI advice/presentation used may not generalize to commercial AI systems or domain experts, Panel sizes and consensus patterns tested may not cover organizational settings with different communication structures

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Accuracy improved for small panels relative to a single AI. Decision Quality	positive	high	accuracy	0.6
Larger panels yielded no gains in accuracy relative to a single AI. Decision Quality	null_result	high	accuracy	0.6
High within-panel consensus fostered overreliance on AI advice. Decision Quality	negative	high	reliance on AI advice (overreliance)	0.6
A single dissent within a panel reduced pressure to conform. Decision Quality	positive	high	pressure to conform / reliance on AI advice	0.6
Wide disagreement among AIs created confusion and undermined appropriate reliance on advice. Decision Quality	negative	high	appropriate reliance on advice / decision-making	0.6
Human-like presentations increased perceived usefulness and agency in certain tasks. Worker Satisfaction	positive	high	perceived usefulness and perceived agency	0.6
Human-like presentations did not raise conformity pressure. Decision Quality	null_result	high	conformity pressure	0.6