Language-model traders replicate human-like mispricing and trading patterns in simulated auctions, and simple prompt tweaks reliably amplify or damp market bubbles; the result shows policymakers and designers can causally steer AI-driven market behavior but warns findings are model- and setting-specific.
We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.
Summary
Main Finding
AI traders built on large language models (LLMs), when placed in a simulated Smith-style open-call asset market, reproduce classic human behavioral finance patterns—most notably a strong disposition effect and recency-weighted extrapolative expectations—and these micro-level heuristics aggregate into familiar macro-level dynamics (predictable excess-demand-driven price paths and disagreement-driven volume). Moreover, the authors can read agents’ Chain-of-Thought text to identify cognitive mechanisms and show that targeted prompt interventions causally amplify or suppress specific biases, materially changing bubble magnitudes — pointing to prompt-level “cognitive guardrails” as a potential policy tool.
Key Points
- Behavioral regularities
- Disposition effect: LLM agents are more likely to sell when holding unrealized gains (statistically significant), mirroring human investor behavior.
- Extrapolative expectations: Agents overweight recent returns when forecasting future prices (recency bias / momentum extrapolation).
- Tight belief-action coupling: Unlike many human subject findings, agents’ stated forecasts closely map into executed trades (limited human frictions → stronger mapping).
- Equilibrium / market-level replication
- Excess demand (measured via bid-offer gap) predicts future price changes, reproducing classic Smith et al. (1988) experimental results.
- Cross-sectional disagreement in beliefs correlates positively with trading volume (heterogeneous beliefs → more trade).
- Aggregate dynamics (bubbles & crashes) emerge endogenously from the micro heuristics.
- Textual reasoning and mechanism identification
- Agents are required to maintain PLANS.txt and INSIGHTS.txt (persistent CoT memory). Extracted text is consistent with measured beliefs and trades.
- A twenty-mechanism scoring framework applied to reasoning text identifies cognitive drivers active in bubble vs non-bubble episodes (e.g., momentum chasing, speculation, anchoring).
- Causal interventions via prompts
- Targeted prompts designed to suppress specific behavioral mechanisms reduce bubble magnitudes.
- Contrasting prompts that amplify those mechanisms increase bubble formation.
- Interventions demonstrate causal control over aggregate outcomes by modifying agent cognition at the prompt level.
- Contributions
- Opens the “black box” by linking internal LLM reasoning to actions and market outcomes.
- Establishes a micro→macro transmission channel for AI-driven markets.
- Proposes an actionable policy lever (prompt design / cognitive guardrails) for systemic stability.
Data & Methods
- Experimental environment
- Multi-period open-call auction (adapted Smith et al. (1988)): 3 practice periods + T = 20 main periods. No short-selling or margin.
- Two assets: cash (risk-free) and one risky asset with per-period dividend drawn iid from {0.4, 1.0} (each with prob. 0.5); E[D]=0.7. Risk-free rate r=5% → constant fundamental value FV = E[D]/r = 14 (terminal buyout implemented to keep FV constant).
- Endowment: each agent starts with 100 cash and 4 shares.
- Agent architecture and population
- Agents are autonomous instances of frontier LLMs (14 different models used; experiments include Single-Model Markets: 20 agents of same LLM; Mixed-Model Markets: 24 agents split evenly across two LLMs).
- Agents instructed to maximize earnings; receive full market-state JSON and maintain persistent PLANS.txt and INSIGHTS.txt across rounds (Chain-of-Thought / memory trace).
- Decision pipeline per round
- Agents first generate explicit textual reasoning (PLANS/INSIGHTS), then forecasts and limit orders.
- Forecasts elicited for horizons h ∈ {0, 2, 5, 10} periods ahead. Forecasts constrained to non-negative integers and bounded; agents receive 5 cash units per forecast within ±2.5 units of realized price (incentivization).
- Trading: simultaneous submission of limit orders (price, quantity). Market clears by call auction that selects the price maximizing executable volume V(p)=min(QB(p),QA(p)); clearing price rounded down; if no overlap price set to midpoint between highest bid and lowest ask.
- Data recorded
- Action & market data: full order books, executed trades, prices Pt, trading volumes, bid-ask spreads, end-of-round portfolios.
- Expectation data: full term structure of forecasts; forecast error Eh_{i,t} = P_{t+h} − f_{i,t}(t+h).
- Textual reasoning: PLANS.txt, INSIGHTS.txt, and JSON thoughts logged and scored.
- Empirical analyses
- Regression tests for disposition effect: Sell dummy (sell>buy in a period) regressed on Gain dummy (price > agent’s weighted-average purchase price), controlling for Average Expectation; Gain dummy positive and highly significant.
- Tests for extrapolation: forecast regressions showing heavier weights on recent returns.
- Macro tests: predictive regressions where bid-offer gap/excess demand forecasts future price changes; cross-sectional dispersion measures tested against trading volume.
- Text scoring: a twenty-mechanism rubric applied to agent reasoning to identify prevalence/intensity of mechanisms across episodes.
- Causal prompt interventions: pre-specified prompts designed to suppress or amplify targeted mechanisms; comparison of bubble magnitudes and market statistics across treatment arms.
Implications for AI Economics
- AI agents inherit human-like behavioral biases: Because LLMs are trained on human-generated financial text, they internalize cognitive patterns (disposition, extrapolation) that can reproduce human-style market inefficiencies. Model training data therefore matters for aggregate market behavior.
- Observable CoT unlocks causal inference: Requiring persistent textual reasoning allows researchers and regulators to trace and score internal motives, significantly improving transparency relative to black-box algorithmic agents.
- Programmability as a systemic lever: Prompt-level interventions can causally alter collective outcomes (reduce or aggravate bubbles). This suggests a feasible regulator/developer toolkit: enforce or certify “cognitive guardrails” (prompt templates, constraints, or pre-deployment audits) to mitigate destabilizing behaviors.
- Market-design and model heterogeneity matter: Mixed-model interactions can generate complex dynamics; regulator attention should extend beyond single-model behavior to cross-model interactions (e.g., rational vs extrapolative LLM types).
- Deployment caution for real markets
- Even absence of human frictions (transaction costs, attention limits) does not guarantee rational markets — LLM-driven traders can generate bubbles via learned heuristics.
- Policy should consider not only model internals but also deployment rules (allowed strategies, required reasoning disclosure, forecast incentives).
- Research directions
- Test robustness in richer market settings (continuous-time trading, short-selling, margin, order book microstructure, intertemporal risk preferences).
- Explore how training data composition and fine-tuning strategies alter behavioral patterns.
- Design and evaluate operational regulatory frameworks: prompt-certification, mandatory CoT disclosures, or real-time monitoring using the authors’ scoring framework.
- Limitations (noted / implied)
- Simulated environment is simplified relative to real financial markets (no high-frequency microstructure, no institutional constraints, stylized dividend process).
- Generalizability across LLM architectures and training regimes needs further study.
- Ethical and practical questions about who sets guardrails and how adversarial actors might circumvent prompts remain.
If you want, I can: - Extract and summarize the twenty mechanisms and example scoring rubric the authors used (if you provide Appendix or examples). - Produce a concise table of the empirical results (disposition regressions, predictive regressions, intervention effects) if you want numeric estimates pulled from the paper.
Assessment
Claims (6)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| AI agents exhibit a pronounced disposition effect. Decision Quality | positive | high | disposition effect (tendency to sell winners and hold losers) |
0.6
|
| AI agents form recency-weighted extrapolative beliefs (i.e., overweight recent price history when forecasting future prices). Decision Quality | positive | high | recency-weighted extrapolative beliefs in price forecasts |
0.6
|
| These individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices. Market Structure | positive | high | predictive power of excess demand for future prices |
0.6
|
| There is a positive relationship between disagreement among agents and trading volume in the simulated markets. Market Structure | positive | high | relationship between disagreement (belief dispersion) and trading volume |
0.6
|
| By analyzing agents' reasoning text through a twenty-mechanism scoring framework, targeted prompt interventions causally amplify or suppress specific behavioral mechanisms. Other | mixed | high | mechanism scores derived from agents' reasoning text (20-mechanism framework) |
0.6
|
| Targeted prompt interventions significantly alter the magnitude of market bubbles (they can amplify or suppress bubble size). Market Structure | mixed | high | magnitude of market bubbles |
0.6
|