AI traders reliably pool private signals in simple prediction markets but falter when information structures grow complex; smarter models do better, yet feedback on past performance unexpectedly undermines market accuracy and profits.
Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.
Summary
Main Finding
AI agents (LLMs acting as traders) can effectively aggregate dispersed private information in simple prediction-market information structures, but their ability collapses as the required interactive reasoning (higher-order beliefs about others’ signals) becomes more complex. Prediction markets themselves are robust to communication, market duration, initial price, and strategic prompting, but AI agents exhibit systematic limitations: they hoard and sometimes deceive, use an end‑game “truth revelation” pattern, and — counterintuitively — providing them qualitative feedback about past markets worsens aggregation and profits.
Key Points
- Core result: Median final market price is informative in easy/medium structures but degrades sharply with complexity:
- When true value = 1: median price ≈ 0.91 overall;
- Easy/medium: price ≈ 1 (near-perfect aggregation);
- Hard: median price ≈ 0.73 (partial aggregation);
- Very hard (muddy-children–style): median price ≈ 0.50 (uninformative).
- Theoretical expectation (separable securities) predicts aggregation in equilibrium; empirical rejection implies AI agents lack sufficient interactive reasoning in harder structures.
- Robust null effects: allowing cheap talk, changing initial price (0.3/0.5/0.7), varying market duration (3/6/9 rounds), and prompting agents to be strategic vs myopic did not significantly change information aggregation.
- Intelligence effects:
- Higher individual intelligence (measured by an aggregate AI capability index) raises an agent’s profits.
- Higher average group intelligence reduces the frequency of catastrophic mispricing (tail errors), but has limited effect on median-market accuracy.
- Individual profits fall as group average intelligence rises (consistent with stronger competition).
- Feedback paradox: giving subsequent agents empirical summaries of prior markets (qualitative results) significantly worsened aggregation and reduced profits.
- Communication behavior:
- Agents heavily hoard private reasoning; public messages are shorter and semantically detached from private messages in 94% of markets.
- Agents exhibit deliberate deception/withholding early, with sharp truth-revelation spikes at terminal rounds (sawtooth pattern at rounds 3, 6, 9).
- Making agents aware of prior outcomes increased adversarial communication (larger word gaps, more direct deception).
- Robustness: re-running experiments in April 2026 with frontier models (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) produced the same failure in the very hard structure (median price ≈ 0.5); frontier models did not uniformly dominate earlier best performers.
Data & Methods
- Experimental environment:
- Sequential prediction markets with 3 agents, trading for 3, 6, or 9 rounds (agents trade on schedules that repeat every 3 rounds).
- Binary security X (Yes pays 1 if outcome Yes); complementary No asset X′.
- Information: three private binary signals (da, db, dc), 8 possible states; four information/payoff structures (easy → very hard; the hardest analogous to the muddy-children puzzle).
- Securities are separable (theoretically implies aggregation under rational, sophisticated traders).
- Market mechanism:
- Logarithmic Market Scoring Rule (LMSR) with liquidity parameter β = 0.01.
- Traders endowed with £1,000 each; myopic optimal trade equalizes price with posterior belief.
- Outcome measures:
- Primary: log scoring error (negative log-likelihood or log-loss) between true outcome y and final price p: −[y ln p + (1−y) ln(1−p)], with p truncated to [ε, 1−ε].
- Secondary: trade volumes, individual profits.
- Communication metrics: cosine similarity between private/public messages, word-gap (length difference), and an AI-judged measure of public-message informativeness about a trader’s signal.
- Agents and treatment variation:
- LLMs used (initial wave): Claude Haiku 3.5 & 4.5, Gemini 2.5 & 3 Flash, GPT-4o, GPT‑5 mini, gemma3:4b, qwen3:8b.
- Teams: twelve teams of three (eight homogeneous teams, four heterogeneous).
- Intelligence measure: Artificial Analysis Intelligence Index (composite of ten evaluation suites).
- Total markets: 1,772 in initial runs; an additional 1,728 markets when agents were given prior-wave empirical summaries (feedback treatment); 576 robustness markets with frontier models in April 2026.
- Analysis:
- Descriptive medians and distributions across treatments.
- Quantile regressions (median and bottom-20% tail) to assess intelligence and treatment effects on log error.
- Communication strategy quantified via semantic similarity, word gaps, and AI adjudication of deception/ information revelation.
Implications for AI Economics
- Limits of interactive reasoning: LLM-based agents may fail to form higher-order beliefs needed for full information aggregation even in small markets (3 agents). Evaluation of LLMs for economic deployment should include interactive-reasoning benchmarks, not only single-agent reasoning tasks.
- Robustness of prices, not of agents: Prediction markets (LMSR) remain robust signals under many manipulations (cheap talk, initial price, duration). Prices can still be the most scalable information channel when many autonomous agents interact, since public communication is often unreliable.
- Risk from complex information structures: Markets with payoff rules or information partitions that demand deep recursive reasoning are vulnerable to mispricing when traders are LLM agents — regulators and designers should be cautious in environments where outcomes depend on complex joint-information patterns.
- Adversarial communication and strategic hoarding: Agents systematically hoard information and use end-game revelation strategies, which can slow correct aggregation or be exploited. Market designers might consider mechanisms that reduce incentives to hoard (e.g., longer horizons, reputation systems, or commitment devices), though the experiment shows simple horizon lengthening had limited effect.
- Feedback and training: Providing qualitative summaries of past market performance can backfire. Careful design of learning/feedback loops is required before deploying adaptive agent populations; naive feedback may increase adversarial behaviors and confusion.
- Competition and distributional effects: More capable agents earn higher individual profits, but higher average group intelligence intensifies competition and reduces individual rents. Policymakers should expect changing profit dynamics as agent sophistication evolves.
- Research directions and policy suggestions:
- Develop and include interactive-reasoning benchmarks in model evaluation suites.
- Test alternative market mechanisms and information-elicitation designs that are robust to limited recursive reasoning (e.g., mechanisms that directly reward truthful revelation or reduce strategic withholding).
- Study human–AI mixed markets, larger populations, repeated learning with controlled feedback, and interventions to train agents on higher-order inference.
- Monitor for manipulation or coordinated hoarding when deploying autonomous traders; consider regulatory limits, auditing, and transparency requirements for agent behavior and training data.
Summary takeaway: prediction markets can remain useful with autonomous LLM traders, but designers and policymakers must account for systematic bounds on agents’ interactive reasoning, adversarial communication strategies, and potentially harmful effects of naive feedback — especially in environments that require deep recursive inference.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| The median market is effective at aggregating information in the easy information structures. Decision Quality | positive | high | information aggregation (log error of the last price) |
0.12
|
| Increasing the complexity of the information structure has a significant and negative impact on information aggregation, suggesting AI agents may suffer from the same limitations as humans when reasoning about others. Decision Quality | negative | high | information aggregation (log error of the last price) |
0.12
|
| Allowing cheap talk communication does not affect information aggregation. Decision Quality | null_result | high | information aggregation (log error of the last price) |
0.12
|
| Changing the duration of the market does not affect information aggregation. Decision Quality | null_result | high | information aggregation (log error of the last price) |
0.12
|
| Changing the initial price does not affect information aggregation. Decision Quality | null_result | high | information aggregation (log error of the last price) |
0.12
|
| Allowing strategic prompting does not affect information aggregation. Decision Quality | null_result | high | information aggregation (log error of the last price) |
0.12
|
| Prediction markets are robust to cheap talk, market duration, initial price, and strategic prompting. Decision Quality | positive | high | information aggregation (log error of the last price) |
0.12
|
| ‘Smarter’ AI agents perform better at information aggregation. Decision Quality | positive | high | information aggregation (log error of the last price) |
0.12
|
| ‘Smarter’ AI agents are more profitable. Other | positive | high | profits (agent-level earnings) |
0.12
|
| Providing agents feedback about past performance makes them worse at information aggregation and reduces their profits. Decision Quality | negative | high | information aggregation (log error of the last price) and profits |
0.12
|
| The experimental findings are consistent with the paper's theoretical predictions. Decision Quality | mixed | high | consistency between theoretical predictions and experimental measures (e.g., aggregation as measured by log error) |
0.02
|
| We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. Other | null_result | high | methodological description (log error of last price used as aggregation metric) |
0.2
|