Consulting small panels of AIs improves decision accuracy compared with a single AI, but adding more systems yields no extra benefit; unanimous AI advice encourages harmful conformity while a lone dissent tempers it, and human-like interfaces boost perceived usefulness without increasing pressure to conform.
Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.
Summary
Main Finding
Small panels of diverse AIs (e.g., 3 advisors) improve human decision accuracy relative to a single AI, but increasing panel size further (e.g., to 5) yields no incremental accuracy gains. The pattern of opinions within a panel matters critically: high consensus among AIs encourages overreliance (conformity) and can harm accuracy; a single dissent reduces conformity pressure and helps calibration; wide disagreement increases confusion and undermines appropriate use of advice. Making AI presentations more human-like raises perceived usefulness and agency for some tasks but does not systematically increase conformity or performance.
Key Points
- Panel size
- 3-AI panels improved accuracy over a single AI.
- 5-AI panels produced no further accuracy improvement (diminishing returns; non‑linear relationship).
- Within-panel consensus
- High consensus across AIs fosters overreliance (informational conformity) and can reduce human corrective behavior.
- One dissenting opinion in an otherwise-consensus panel reduces conformity pressure and promotes more appropriate reliance.
- Strong disagreement (wide spread) increases cognitive load/confusion, lowering effective reliance and performance.
- Human-likeness / presentation style
- Human-like tone/persona increased subjective usefulness and sense of agency in some tasks.
- Human-likeness showed individual differences in conformity effects but did not, on average, worsen conformity or objective performance.
- Practical interpretation: “More AI advisors” is not unambiguously better — the number, distribution of opinions, and presentation format jointly determine whether multi-AI advice improves or undermines decision quality.
Data & Methods
- Design
- Two studies (between- and within-subject elements) with a total N = 348 participants (260 in Study 1; 88 in Study 2).
- Tasks: three binary prediction tasks — Income (UCI Adult), Recidivism (COMPAS), and Dating (speed-dating dataset). Tasks chosen to be moderately difficult (human-only accuracy ~60–70%).
- Procedure: Judge-Advisor System (JAS) — participants give an initial prediction and confidence, receive AI advice(s) with natural-language explanations and attention checks, then give a final prediction and confidence.
- Manipulations
- Study 1: Panel size varied between participants (1, 3, or 5 AIs); within-panel consensus varied across trials (full consensus, single dissent, wide disagreement).
- Study 2: Fixed 3-AI panels; varied anthropomorphic/human-like presentation of AI explanations/personas.
- AI advisors & explanations
- Advisors simulated by sampling individual decision trees from a random-forest Rashomon set calibrated to 70% accuracy (selected trees each had 35/50 correct on test cases).
- Explanations generated by extracting top-3 SHAP features per tree and converting these into concise natural-language explanations via GPT-4o (LLM-in-the-loop explanation pipeline).
- In multi-AI conditions, models were sampled per participant to create natural variation in predicted labels and rationales.
- Measures & quality control
- Primary outcomes: change in accuracy from initial to final decision (reliance/adjustment), confidence, subjective measures (perceived usefulness, agency), and conformity indicators.
- Attention checks and response-quality screening; participants restricted to adults in Asia fluent in Japanese; no external AI use allowed.
Implications for AI Economics
- Product design and platform strategy
- Optimal product offerings: platforms should not assume more models always add value. Small curated panels (e.g., 3 diverse models) may maximize user accuracy and satisfaction while minimizing cognitive cost.
- Interface design matters economically: presenting one dissenting opinion or highlighting minority views can be a low-cost design lever to reduce harmful conformity and improve decision quality.
- Aggregation vs. plural presentation: naive display of many model outputs can create information overload or herding; platforms should offer summarized statistics (majority vote, calibrated uncertainty) alongside curated dissent.
- Market structure and model differentiation
- Value of true diversity: economic welfare gains from a marketplace of models depend on model independence. If many providers are trained on similar data, apparent consensus can create systemic herding risk (correlated errors) and mislead users.
- Product heterogeneity matters: incentives for model providers to differentiate (methodologically or data-wise) can increase the social value of multi-AI advice, but unchecked proliferation without meaningful diversity yields little marginal benefit.
- Consumer behavior, adoption, and willingness to pay
- Human-likeness increases perceived usefulness and agency, which may raise adoption or willingness to pay even when objective accuracy is unaffected — implying firms can monetize presentation/UX separately from model performance.
- However, increased perceived usefulness may encourage more reliance; regulators and firms should ensure that perception aligns with calibrated performance to avoid overtrust.
- Welfare, externalities, and regulation
- Herding risk: high apparent consensus among deployed models can induce conformity and amplify errors at scale (systemic risk). Regulators should consider disclosure of model provenance, correlation, and uncertainty.
- Mandatory or recommended UI features (e.g., visible minority opinions, uncertainty bands, or model provenance labels) could mitigate conformity externalities and improve aggregate decision quality.
- Procurement and evaluation: institutions buying multi-AI decision support should evaluate panels for diversity and robustness, not just raw model count or averaged accuracy.
- Pricing and bundling
- Bundles of complementary (diverse) models are more valuable than bundles of many similar ones. Pricing strategies should reflect marginal value of additional models (diminishing returns) and the cost of potential conformity harms.
- Labor-market & competition effects
- Multi-AI setups that appear more persuasive (human-like) may displace certain advisory roles faster. Conversely, systems that provide curated dissent could support human experts by improving calibration.
- Research & measurement implications for economists
- When modeling information aggregation markets, incorporate behavioral conformity and cognitive cost parameters: the social value of an additional advisor is endogenous to users’ conformity responses and interface design.
- Policy evaluations (e.g., mandated labelings, audit regimes) should account for how presentation and within-panel consensus shape human users’ reliance, not only model accuracy metrics.
Caveats - Laboratory/crowdsourced setting with simulated decision trees and LLM-generated explanations; external validity to high-stakes real-world settings needs further field work. - Participant pool, language, and task scope (three binary tasks) limit generalizability; model-training correlations in real markets may be stronger than simulated here. - The AIs were calibrated to ~70% accuracy; results could differ with much stronger or weaker models.
Overall takeaway for AI economics: count of AI advisors is an imperfect proxy for decision quality. Market and platform designers should focus on curated diversity, dissent visibility, calibrated explanations, and interface features that mitigate conformity to capture the genuine economic value of multi-AI advice.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Accuracy improved for small panels relative to a single AI. Decision Quality | positive | high | accuracy |
0.6
|
| Larger panels yielded no gains in accuracy relative to a single AI. Decision Quality | null_result | high | accuracy |
0.6
|
| High within-panel consensus fostered overreliance on AI advice. Decision Quality | negative | high | reliance on AI advice (overreliance) |
0.6
|
| A single dissent within a panel reduced pressure to conform. Decision Quality | positive | high | pressure to conform / reliance on AI advice |
0.6
|
| Wide disagreement among AIs created confusion and undermined appropriate reliance on advice. Decision Quality | negative | high | appropriate reliance on advice / decision-making |
0.6
|
| Human-like presentations increased perceived usefulness and agency in certain tasks. Worker Satisfaction | positive | high | perceived usefulness and perceived agency |
0.6
|
| Human-like presentations did not raise conformity pressure. Decision Quality | null_result | high | conformity pressure |
0.6
|