Language-model traders replicate human-like mispricing and trading patterns in simulated auctions, and simple prompt tweaks reliably amplify or damp market bubbles; the result shows policymakers and designers can causally steer AI-driven market behavior but warns findings are model- and setting-specific.

Dissecting AI Trading: Behavioral Finance and Market Bubbles

Shumiao Ouyang, Pengfei Sui · April 20, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

Autonomous LLM traders reproduce human-like biases (disposition effect, recency-weighted extrapolation), generate classic experimental market dynamics, and targeted prompt interventions causally modulate specific behavioral mechanisms to amplify or suppress market bubbles.

We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.

Summary

Main Finding

AI traders built on large language models (LLMs), when placed in a simulated Smith-style open-call asset market, reproduce classic human behavioral finance patterns—most notably a strong disposition effect and recency-weighted extrapolative expectations—and these micro-level heuristics aggregate into familiar macro-level dynamics (predictable excess-demand-driven price paths and disagreement-driven volume). Moreover, the authors can read agents’ Chain-of-Thought text to identify cognitive mechanisms and show that targeted prompt interventions causally amplify or suppress specific biases, materially changing bubble magnitudes — pointing to prompt-level “cognitive guardrails” as a potential policy tool.

Key Points

Behavioral regularities
- Disposition effect: LLM agents are more likely to sell when holding unrealized gains (statistically significant), mirroring human investor behavior.
- Extrapolative expectations: Agents overweight recent returns when forecasting future prices (recency bias / momentum extrapolation).
- Tight belief-action coupling: Unlike many human subject findings, agents’ stated forecasts closely map into executed trades (limited human frictions → stronger mapping).
Equilibrium / market-level replication
- Excess demand (measured via bid-offer gap) predicts future price changes, reproducing classic Smith et al. (1988) experimental results.
- Cross-sectional disagreement in beliefs correlates positively with trading volume (heterogeneous beliefs → more trade).
- Aggregate dynamics (bubbles & crashes) emerge endogenously from the micro heuristics.
Textual reasoning and mechanism identification
- Agents are required to maintain PLANS.txt and INSIGHTS.txt (persistent CoT memory). Extracted text is consistent with measured beliefs and trades.
- A twenty-mechanism scoring framework applied to reasoning text identifies cognitive drivers active in bubble vs non-bubble episodes (e.g., momentum chasing, speculation, anchoring).
Causal interventions via prompts
- Targeted prompts designed to suppress specific behavioral mechanisms reduce bubble magnitudes.
- Contrasting prompts that amplify those mechanisms increase bubble formation.
- Interventions demonstrate causal control over aggregate outcomes by modifying agent cognition at the prompt level.
Contributions
- Opens the “black box” by linking internal LLM reasoning to actions and market outcomes.
- Establishes a micro→macro transmission channel for AI-driven markets.
- Proposes an actionable policy lever (prompt design / cognitive guardrails) for systemic stability.

Data & Methods

Experimental environment
- Multi-period open-call auction (adapted Smith et al. (1988)): 3 practice periods + T = 20 main periods. No short-selling or margin.
- Two assets: cash (risk-free) and one risky asset with per-period dividend drawn iid from {0.4, 1.0} (each with prob. 0.5); E[D]=0.7. Risk-free rate r=5% → constant fundamental value FV = E[D]/r = 14 (terminal buyout implemented to keep FV constant).
- Endowment: each agent starts with 100 cash and 4 shares.
Agent architecture and population
- Agents are autonomous instances of frontier LLMs (14 different models used; experiments include Single-Model Markets: 20 agents of same LLM; Mixed-Model Markets: 24 agents split evenly across two LLMs).
- Agents instructed to maximize earnings; receive full market-state JSON and maintain persistent PLANS.txt and INSIGHTS.txt across rounds (Chain-of-Thought / memory trace).
Decision pipeline per round
- Agents first generate explicit textual reasoning (PLANS/INSIGHTS), then forecasts and limit orders.
- Forecasts elicited for horizons h ∈ {0, 2, 5, 10} periods ahead. Forecasts constrained to non-negative integers and bounded; agents receive 5 cash units per forecast within ±2.5 units of realized price (incentivization).
- Trading: simultaneous submission of limit orders (price, quantity). Market clears by call auction that selects the price maximizing executable volume V(p)=min(QB(p),QA(p)); clearing price rounded down; if no overlap price set to midpoint between highest bid and lowest ask.
Data recorded
- Action & market data: full order books, executed trades, prices Pt, trading volumes, bid-ask spreads, end-of-round portfolios.
- Expectation data: full term structure of forecasts; forecast error Eh_{i,t} = P_{t+h} − f_{i,t}(t+h).
- Textual reasoning: PLANS.txt, INSIGHTS.txt, and JSON thoughts logged and scored.
Empirical analyses
- Regression tests for disposition effect: Sell dummy (sell>buy in a period) regressed on Gain dummy (price > agent’s weighted-average purchase price), controlling for Average Expectation; Gain dummy positive and highly significant.
- Tests for extrapolation: forecast regressions showing heavier weights on recent returns.
- Macro tests: predictive regressions where bid-offer gap/excess demand forecasts future price changes; cross-sectional dispersion measures tested against trading volume.
- Text scoring: a twenty-mechanism rubric applied to agent reasoning to identify prevalence/intensity of mechanisms across episodes.
- Causal prompt interventions: pre-specified prompts designed to suppress or amplify targeted mechanisms; comparison of bubble magnitudes and market statistics across treatment arms.

Implications for AI Economics

AI agents inherit human-like behavioral biases: Because LLMs are trained on human-generated financial text, they internalize cognitive patterns (disposition, extrapolation) that can reproduce human-style market inefficiencies. Model training data therefore matters for aggregate market behavior.
Observable CoT unlocks causal inference: Requiring persistent textual reasoning allows researchers and regulators to trace and score internal motives, significantly improving transparency relative to black-box algorithmic agents.
Programmability as a systemic lever: Prompt-level interventions can causally alter collective outcomes (reduce or aggravate bubbles). This suggests a feasible regulator/developer toolkit: enforce or certify “cognitive guardrails” (prompt templates, constraints, or pre-deployment audits) to mitigate destabilizing behaviors.
Market-design and model heterogeneity matter: Mixed-model interactions can generate complex dynamics; regulator attention should extend beyond single-model behavior to cross-model interactions (e.g., rational vs extrapolative LLM types).
Deployment caution for real markets
- Even absence of human frictions (transaction costs, attention limits) does not guarantee rational markets — LLM-driven traders can generate bubbles via learned heuristics.
- Policy should consider not only model internals but also deployment rules (allowed strategies, required reasoning disclosure, forecast incentives).
Research directions
- Test robustness in richer market settings (continuous-time trading, short-selling, margin, order book microstructure, intertemporal risk preferences).
- Explore how training data composition and fine-tuning strategies alter behavioral patterns.
- Design and evaluate operational regulatory frameworks: prompt-certification, mandatory CoT disclosures, or real-time monitoring using the authors’ scoring framework.
Limitations (noted / implied)
- Simulated environment is simplified relative to real financial markets (no high-frequency microstructure, no institutional constraints, stylized dividend process).
- Generalizability across LLM architectures and training regimes needs further study.
- Ethical and practical questions about who sets guardrails and how adversarial actors might circumvent prompts remain.

If you want, I can: - Extract and summarize the twenty mechanisms and example scoring rubric the authors used (if you provide Appendix or examples). - Produce a concise table of the empirical results (disposition regressions, predictive regressions, intervention effects) if you want numeric estimates pulled from the paper.

Assessment

Paper Typerct Evidence Strengthmedium — Strong internal validity for the claim that prompt interventions change LLM-agent behavior within the simulated environment (randomized treatments, direct observation of generated reasoning and trades), but external validity is limited: results depend on specific LLM(s), prompt designs, simulation parameters and may not map directly to human traders or real-world market structure. Methods Rigormedium — The design combines controlled randomization, replication of classic experimental market patterns, and a structured 20-mechanism coding of reasoning text, which shows thoughtful triangulation; however, potential weaknesses include model- and prompt-specific sensitivity, possible subjectivity in mechanism scoring, unclear robustness across LLM versions/seeds/market parametrizations, and limited information about replication and pre-registration. SampleSimulated asset-market experiments run as open-call auctions populated by autonomous agents powered by one or more large language models; agents produce both trading actions and natural-language reasoning which were scored across a 20-mechanism framework; analyses compare multiple simulated market runs and randomized prompt-treatment arms (exact number of agents, runs, and model versions not specified in the summary). Themesgovernance innovation IdentificationControlled simulated open-call auctions with autonomous LLM agents and randomized prompt interventions that assign targeted prompts (treatments) to agents; causal effects inferred from between-treatment comparisons in a controlled simulation environment and from pre/post manipulation of prompt content together with a mechanistic scoring framework applied to agents' generated reasoning. GeneralizabilityFindings are tied to the specific LLM architectures, training data and prompt designs used and may not generalize to other models or future versions., Simulated markets abstract away real-world market complexity (institutions, latency, strategic heterogeneity, liquidity providers, regulatory constraints)., Human traders may behave differently; results from purely AI-agent markets do not directly translate to human-AI or mixed markets., Scale effects (number of agents, market depth) and longer-horizon adaptation are not necessarily captured., Mechanism scoring may depend on annotation choices and may not capture latent reasoning not expressed in text.

Claims (6)

Claim	Direction	Confidence	Outcome	Details
AI agents exhibit a pronounced disposition effect. Decision Quality	positive	high	disposition effect (tendency to sell winners and hold losers)	0.6
AI agents form recency-weighted extrapolative beliefs (i.e., overweight recent price history when forecasting future prices). Decision Quality	positive	high	recency-weighted extrapolative beliefs in price forecasts	0.6
These individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices. Market Structure	positive	high	predictive power of excess demand for future prices	0.6
There is a positive relationship between disagreement among agents and trading volume in the simulated markets. Market Structure	positive	high	relationship between disagreement (belief dispersion) and trading volume	0.6
By analyzing agents' reasoning text through a twenty-mechanism scoring framework, targeted prompt interventions causally amplify or suppress specific behavioral mechanisms. Other	mixed	high	mechanism scores derived from agents' reasoning text (20-mechanism framework)	0.6
Targeted prompt interventions significantly alter the magnitude of market bubbles (they can amplify or suppress bubble size). Market Structure	mixed	high	magnitude of market bubbles	0.6