Information Aggregation with AI Agents

Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.

Summary

Main Finding

AI agents (LLMs acting as traders) can effectively aggregate dispersed private information in simple prediction-market information structures, but their ability collapses as the required interactive reasoning (higher-order beliefs about others’ signals) becomes more complex. Prediction markets themselves are robust to communication, market duration, initial price, and strategic prompting, but AI agents exhibit systematic limitations: they hoard and sometimes deceive, use an end‑game “truth revelation” pattern, and — counterintuitively — providing them qualitative feedback about past markets worsens aggregation and profits.

Key Points

Core result: Median final market price is informative in easy/medium structures but degrades sharply with complexity:
- When true value = 1: median price ≈ 0.91 overall;
- Easy/medium: price ≈ 1 (near-perfect aggregation);
- Hard: median price ≈ 0.73 (partial aggregation);
- Very hard (muddy-children–style): median price ≈ 0.50 (uninformative).
Theoretical expectation (separable securities) predicts aggregation in equilibrium; empirical rejection implies AI agents lack sufficient interactive reasoning in harder structures.
Robust null effects: allowing cheap talk, changing initial price (0.3/0.5/0.7), varying market duration (3/6/9 rounds), and prompting agents to be strategic vs myopic did not significantly change information aggregation.
Intelligence effects:
- Higher individual intelligence (measured by an aggregate AI capability index) raises an agent’s profits.
- Higher average group intelligence reduces the frequency of catastrophic mispricing (tail errors), but has limited effect on median-market accuracy.
- Individual profits fall as group average intelligence rises (consistent with stronger competition).
Feedback paradox: giving subsequent agents empirical summaries of prior markets (qualitative results) significantly worsened aggregation and reduced profits.
Communication behavior:
- Agents heavily hoard private reasoning; public messages are shorter and semantically detached from private messages in 94% of markets.
- Agents exhibit deliberate deception/withholding early, with sharp truth-revelation spikes at terminal rounds (sawtooth pattern at rounds 3, 6, 9).
- Making agents aware of prior outcomes increased adversarial communication (larger word gaps, more direct deception).
Robustness: re-running experiments in April 2026 with frontier models (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) produced the same failure in the very hard structure (median price ≈ 0.5); frontier models did not uniformly dominate earlier best performers.

Data & Methods

Experimental environment:
- Sequential prediction markets with 3 agents, trading for 3, 6, or 9 rounds (agents trade on schedules that repeat every 3 rounds).
- Binary security X (Yes pays 1 if outcome Yes); complementary No asset X′.
- Information: three private binary signals (da, db, dc), 8 possible states; four information/payoff structures (easy → very hard; the hardest analogous to the muddy-children puzzle).
- Securities are separable (theoretically implies aggregation under rational, sophisticated traders).
Market mechanism:
- Logarithmic Market Scoring Rule (LMSR) with liquidity parameter β = 0.01.
- Traders endowed with £1,000 each; myopic optimal trade equalizes price with posterior belief.
Outcome measures:
- Primary: log scoring error (negative log-likelihood or log-loss) between true outcome y and final price p: −[y ln p + (1−y) ln(1−p)], with p truncated to [ε, 1−ε].
- Secondary: trade volumes, individual profits.
- Communication metrics: cosine similarity between private/public messages, word-gap (length difference), and an AI-judged measure of public-message informativeness about a trader’s signal.
Agents and treatment variation:
- LLMs used (initial wave): Claude Haiku 3.5 & 4.5, Gemini 2.5 & 3 Flash, GPT-4o, GPT‑5 mini, gemma3:4b, qwen3:8b.
- Teams: twelve teams of three (eight homogeneous teams, four heterogeneous).
- Intelligence measure: Artificial Analysis Intelligence Index (composite of ten evaluation suites).
- Total markets: 1,772 in initial runs; an additional 1,728 markets when agents were given prior-wave empirical summaries (feedback treatment); 576 robustness markets with frontier models in April 2026.
Analysis:
- Descriptive medians and distributions across treatments.
- Quantile regressions (median and bottom-20% tail) to assess intelligence and treatment effects on log error.
- Communication strategy quantified via semantic similarity, word gaps, and AI adjudication of deception/ information revelation.

Implications for AI Economics

Limits of interactive reasoning: LLM-based agents may fail to form higher-order beliefs needed for full information aggregation even in small markets (3 agents). Evaluation of LLMs for economic deployment should include interactive-reasoning benchmarks, not only single-agent reasoning tasks.
Robustness of prices, not of agents: Prediction markets (LMSR) remain robust signals under many manipulations (cheap talk, initial price, duration). Prices can still be the most scalable information channel when many autonomous agents interact, since public communication is often unreliable.
Risk from complex information structures: Markets with payoff rules or information partitions that demand deep recursive reasoning are vulnerable to mispricing when traders are LLM agents — regulators and designers should be cautious in environments where outcomes depend on complex joint-information patterns.
Adversarial communication and strategic hoarding: Agents systematically hoard information and use end-game revelation strategies, which can slow correct aggregation or be exploited. Market designers might consider mechanisms that reduce incentives to hoard (e.g., longer horizons, reputation systems, or commitment devices), though the experiment shows simple horizon lengthening had limited effect.
Feedback and training: Providing qualitative summaries of past market performance can backfire. Careful design of learning/feedback loops is required before deploying adaptive agent populations; naive feedback may increase adversarial behaviors and confusion.
Competition and distributional effects: More capable agents earn higher individual profits, but higher average group intelligence intensifies competition and reduces individual rents. Policymakers should expect changing profit dynamics as agent sophistication evolves.
Research directions and policy suggestions:
- Develop and include interactive-reasoning benchmarks in model evaluation suites.
- Test alternative market mechanisms and information-elicitation designs that are robust to limited recursive reasoning (e.g., mechanisms that directly reward truthful revelation or reduce strategic withholding).
- Study human–AI mixed markets, larger populations, repeated learning with controlled feedback, and interventions to train agents on higher-order inference.
- Monitor for manipulation or coordinated hoarding when deploying autonomous traders; consider regulatory limits, auditing, and transparency requirements for agent behavior and training data.

Summary takeaway: prediction markets can remain useful with autonomous LLM traders, but designers and policymakers must account for systematic bounds on agents’ interactive reasoning, adversarial communication strategies, and potentially harmful effects of naive feedback — especially in environments that require deep recursive inference.

Assessment

Paper Typeother Evidence Strengthmedium — The design provides strong internal validity for how the chosen AI agents behave in the specified market environments (randomized treatments, pre-registered metrics, and multiple robustness checks), but results rest on simulated agents, specific LLM versions/prompts, simplified market primitives and information structures, limiting external validity for real-world markets or heterogeneous AI deployments. Methods Rigormedium — The paper uses a clear experimental protocol, a theoretically motivated metric (log error of last price), and several robustness checks (cheap talk, duration, initial price, prompting), and connects findings to theoretical predictions; however, rigor is limited by dependence on a narrow set of agent implementations, potential sensitivity to prompting/temperature/hyperparameters, and simplified incentives and market settings that may omit key real-world frictions. SampleMultiple simulated prediction-market sessions populated by AI agents implemented with large language models; each agent receives privately drawn signals from designed information structures (classified as 'easy' or 'complex'); treatments include presence/absence of cheap talk, market duration, initial price, strategic prompting, and feedback on past performance; primary outcomes are log error of the final market price (information aggregation) and agent profits. Themeshuman_ai_collab governance IdentificationControlled laboratory-style experiments using simulated AI agents (LLM-based) trading in stylized prediction markets; causal effects identified by randomizing information-structure complexity and experimental treatments (allowing cheap talk, varying market duration and initial price, strategic prompting, giving feedback), and comparing outcomes (log error of final price, agent profits) across treatment arms and agent ability levels. GeneralizabilitySimulated AI agents (specific LLMs and prompts) may not represent other model families, sizes, or future versions, Simplified prediction-market design and incentive structure differ from real financial markets or large-scale forecasting platforms, Limited range of information structures; real-world information may be higher-dimensional and dynamic, No human–AI mixed markets tested, so results may not transfer to human traders interacting with AI, Behavior may be sensitive to prompt engineering, model temperature, and training-data overlap with task content

Claims (12)

Claim	Direction	Confidence	Outcome	Details
The median market is effective at aggregating information in the easy information structures. Decision Quality	positive	high	information aggregation (log error of the last price)	0.12
Increasing the complexity of the information structure has a significant and negative impact on information aggregation, suggesting AI agents may suffer from the same limitations as humans when reasoning about others. Decision Quality	negative	high	information aggregation (log error of the last price)	0.12
Allowing cheap talk communication does not affect information aggregation. Decision Quality	null_result	high	information aggregation (log error of the last price)	0.12
Changing the duration of the market does not affect information aggregation. Decision Quality	null_result	high	information aggregation (log error of the last price)	0.12
Changing the initial price does not affect information aggregation. Decision Quality	null_result	high	information aggregation (log error of the last price)	0.12
Allowing strategic prompting does not affect information aggregation. Decision Quality	null_result	high	information aggregation (log error of the last price)	0.12
Prediction markets are robust to cheap talk, market duration, initial price, and strategic prompting. Decision Quality	positive	high	information aggregation (log error of the last price)	0.12
‘Smarter’ AI agents perform better at information aggregation. Decision Quality	positive	high	information aggregation (log error of the last price)	0.12
‘Smarter’ AI agents are more profitable. Other	positive	high	profits (agent-level earnings)	0.12
Providing agents feedback about past performance makes them worse at information aggregation and reduces their profits. Decision Quality	negative	high	information aggregation (log error of the last price) and profits	0.12
The experimental findings are consistent with the paper's theoretical predictions. Decision Quality	mixed	high	consistency between theoretical predictions and experimental measures (e.g., aggregation as measured by log error)	0.02
We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. Other	null_result	high	methodological description (log error of last price used as aggregation metric)	0.2

AI traders reliably pool private signals in simple prediction markets but falter when information structures grow complex; smarter models do better, yet feedback on past performance unexpectedly undermines market accuracy and profits.