Nonstandard Errors in AI Agents

We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

Summary

Main Finding

State-of-the-art autonomous AI coding agents produce sizable “nonstandard errors” (NSEs) — agent-to-agent variation in empirical results analogous to human researcher heterogeneity — and this variation is structured: it concentrates in discrete measure-choice forks (which operationalization to compute) rather than in estimation-paradigm forks. Peer-review style feedback from other AI agents does not reduce dispersion, while exposure to a small set of high-rated exemplar papers produces dramatic correlated convergence (imitation and measure-family switching).

Key Points

Experimental scale and setup
- 150 fully autonomous Claude Code agents (100 Sonnet 4.6, 50 Opus 4.6).
- Agents operate end-to-end (data exploration, code, estimation, report writing) on the same NYSE TAQ dataset for SPY (2015–2024; ≈66 GB, >7 billion rows).
- Six hypotheses tested (H1 market efficiency; H2 quoted spread; H3 realized spread; H4 daily trading volume; H5 intraday volatility; H6 price impact).
- Three-stage protocol: S1 independent analysis, S2 AI peer review (two AI evaluators per agent + written feedback), S3 exposure to five top-rated exemplar Stage-2 papers.
- Agents isolated (no inter-agent file access); stochasticity arises from language model sampling (temperature = 1.0).
Magnitude and shape of AI NSE
- NSE is substantial and heterogeneous across hypotheses. Stage-1 IQRs (effect size in %/yr):
  - H1 (market efficiency): IQR = 2.43 %/yr (range ≈ −0.74 to +1.7).
  - H2 (quoted spread): IQR = 0.43 %/yr (agents nearly unanimous: decline ≈ −6.2%/yr).
  - H3 (realized spread): IQR = 5.28 %/yr.
  - H4 (volume): IQR = 10.69 %/yr (bimodal: dollar volume ≈ +6.1%/yr vs. share volume ≈ −4.6%/yr).
  - H5 (intraday volatility): IQR = 0.54 %/yr.
  - H6 (price impact): IQR = 10.34 %/yr (split between trade-level impact and Amihud measures).
- When stratified by measure family (i.e., within a chosen operationalization), dispersion is small; cross-family gaps drive most NSE (discrete forks).
Systematic model-family differences (“empirical styles”)
- Sonnet 4.6 agents: strong preference for autocorrelation measures (H1: 87% autocorr), level OLS, daily frequency.
- Opus 4.6 agents: strong preference for variance-ratio measures (H1: 100% VR), log OLS, higher use of monthly frequency.
- These preferences are stable and systematic rather than random noise.
Feedback dynamics
- S1 → S2 (peer review): undirected movement; IQRs essentially unchanged on average (written AI peer feedback did not reduce dispersion).
- S2 → S3 (exemplar exposure): dramatic convergence in many measure families — IQR reduction of ~80–99% where agents converged. Convergence occurs by (a) within-family tightening and (b) agents switching measure families (e.g., 62 of 87 autocorrelation agents switched to variance ratio after exemplar exposure).
- Exemplar exposure can also increase dispersion if exemplars introduce additional methodological options that agents adopt inconsistently.
Practical notes
- All effect sizes converted to a common unit (percentage change per year relative to measure mean) via an automated conversion pipeline that reads agent code and processed data.
- Total API cost ≈ $1,558 across 450 agent-stage runs. Median wall-clock times: S1 ≈ 53 min, S2 ≈ 27 min, S3 ≈ 25 min.

Data & Methods

Agents and platform
- Instances of Claude Code (Anthropic) run fully autonomously in isolated Singularity containers with filesystem isolation and fixed compute budgets (up to 40 GB RAM, 6 CPUs; token/billing caps).
- Two model variants: claude-sonnet-4-6 (n=100) and claude-opus-4-6 (n=50); default sampling (temperature = 1.0); agents used a --dangerously-skip-permissions flag to permit full autonomy in the experimental environment.
Data
- NYSE TAQ millisecond trade & quote Parquet files for SPY covering 2015–2024 (≈2,516 trading days).
Hypotheses and deliverables
- Six hypotheses with varying levels of operational specificity. H2/H3 provide explicit measures; H1, H4, H6 are abstract and admit multiple valid operationalizations.
- Each agent produced: structured CSV of estimates (effect, s.e., t-stat), code, figures, and a 2,000–4,000 word report.
Protocol
- Stage 1: independent analysis.
- Stage 2: peer review — each agent received two anonymized written evaluations (one Sonnet, one Opus) and revised.
- Stage 3: agents received the five highest-rated anonymized Stage-2 reports and performed final revisions.
Normalization and analysis
- An automated conversion agent parsed each agent’s estimation code and processed data to normalize reported effects into %/yr relative to the series mean (to make level vs. log vs. differently scaled reports comparable).
- Multiverse-style decomposition: analyses stratified by discrete decision forks (measure family, frequency, functional form) to attribute dispersion to sources.
Comparisons & context
- Design mirrors human many-analyst studies (e.g., Menkveld et al., 2024) to isolate whether NSEs reflect task underspecification (if AI — sharing foundation model training — still differs) versus researcher idiosyncrasies.

Implications for AI Economics

Automated research is not inherently reproducible across independent AI runs
- Even with shared architecture and data, autonomous AI researchers can produce divergent conclusions when research questions are underspecified. Reporting a single AI-run estimate risks hiding substantive model/measure uncertainty.
Measure-choice uncertainty must be treated as first-order epistemic risk
- Most AI NSE was driven by discrete operationalization choices (measure families). Robust AI-driven inference requires (a) explicitly pre-specified measures or (b) transparent multiverse/specification-curve reporting that reports outcomes across plausible operationalizations.
Model-family heterogeneity matters
- Different foundation-model variants embed systematic methodological preferences (“empirical styles”). Policy evaluations or automated pipelines that rely on a single model family risk bias toward that family’s empirical style. Using diverse model families (ensembles) should be standard practice for robustness.
Peer review alone (as implemented here) is insufficient to reduce AI-caused variation
- AI peer feedback produced undirected changes and did not reduce dispersion. By contrast, exemplar exposure produced strong imitation, which reduces dispersion but may produce correlated bias (sycophancy/imitation). Governance must balance inducement of convergence (useful for coordination) vs. the risk of groupthink.
Practical recommendations for AI-based empirical pipelines
- Require agents to (1) output exact operational definitions and code for all measures, (2) run and report a prespecified multiverse of reasonable operationalizations, and (3) quantify NSE (IQR/IDR across agents or within multiverse) alongside standard errors.
- Use multiple model families and seed settings to identify “empirical-style” sensitivity; treat convergence to exemplars as both a signal and a risk.
- Curate exemplars carefully: exemplar-driven convergence can be useful when exemplars are validated, but can also propagate unexamined methodological choices.
- Maintain human oversight for interpretation of model-driven convergences, particularly for policy-relevant inference.
Broader conceptual implications
- The persistence of NSE in AI agents suggests that much of the dispersion in human multi-analyst studies reflects task underspecification (the research question admits many valid paths), not only researcher idiosyncrasy. Thus, reducing epistemic uncertainty in economics requires clearer task specification and standardized reporting practices, whether the analyst is human or machine.
Limitations and caution
- Results are for Claude Code Sonnet/Opus variants on one dataset (SPY TAQ) and for a particular set of hypotheses and instructions; other models, temperatures, datasets, or instruction designs could yield different NSE patterns.
- Exemplar selection, peer-review design, and agent autonomy settings affect dynamics; governance choices will shape practical outcomes in deployed systems.

If you’d like, I can: - Extract the exact measure families and representative code snippets used by agents for each hypothesis, - Produce a concise checklist for deploying AI agents in policy evaluation to mitigate NSE, - Or draft a short policy brief summarizing recommended governance practices for automated empirical research.

Assessment

Paper Typeother Evidence Strengthmedium — Strong internal evidence that agent-to-agent variation (NSEs) and stage-wise changes in dispersion exist in this experimental setting (large number of agents, recorded choices, clear before/after comparison). However external validity is limited by use of a single dataset (SPY TAQ), two model families from one provider, a synthetic experimental workflow, and potential sensitivity to prompt/exemplar selection and agent configuration. Methods Rigormedium — The study uses a systematic, pre-specified protocol, a large N of autonomous agents, careful tracking of methodological choices and dispersion metrics, and stage-wise interventions, which support credible internal inference. Rigor is tempered by opaque details (e.g., degree of randomization, prompt engineering, exemplar selection criteria), reliance on one task/domain, and absence of human-analyst or out-of-sample validation to assess real-world behavior. SampleNY Stock Exchange TAQ transaction and quote data for SPY spanning 2015–2024; 150 autonomous Claude Code agents split across two model families (Sonnet 4.6 and Opus 4.6) independently executed analyses testing six pre-specified market-quality hypotheses under a three-stage feedback protocol (independent, peer review, exposure to exemplar papers). Themeshuman_ai_collab governance IdentificationDesigned multi-agent experiment: 150 autonomous AI coding agents (two model families) independently run analyses on the same raw data and pre-specified hypotheses, with within-agent and between-agent comparisons across three sequential stages (independent analysis, AI peer review, exposure to top-rated exemplar papers) to identify effects of feedback/exposure and model-family on methodological choices and estimate dispersion. GeneralizabilitySingle financial instrument (SPY) and market-microstructure measures — may not generalize to other economic domains or datasets, Only two model families (Sonnet 4.6, Opus 4.6) from one provider — other models or versions may behave differently, Artificial laboratory workflow (fully autonomous agents, curated exemplars) may not reflect human–AI collaborative workflows or deployed policy-analysis pipelines, Results may depend on prompt templates, exemplar selection, and agent hyperparameters that are not fully reported, No human-analyst benchmark or out-of-sample validation to show how NSEs compare to human heterogeneity in practice

Claims (10)

Claim	Direction	Confidence	Outcome	Details
AI-to-AI variation (nonstandard errors, NSEs) across autonomous coding agents produces substantial uncertainty in empirical results analogous to human researcher heterogeneity. Research Productivity	positive	high	agent-to-agent variation in methodological choices and effect estimates (dispersion; e.g., interquartile range of estimates)	n=150 substantial agent-to-agent variation (dispersion measured via IQR) across 150 autonomous coding agents 0.12
Agents split on measure choice (e.g., autocorrelation vs. variance-ratio tests; dollar-volume vs. share-volume measures), producing different substantive estimates from the same raw data and hypotheses. Research Productivity	positive	high	measure selection (categorical) and resulting substantive effect estimates (continuous)	n=150 agents split on measure choice producing different substantive estimates from same data/hypotheses 0.12
Different model families (Sonnet 4.6 vs. Opus 4.6) exhibit stable, systematic differences in methodological preferences and choice patterns—distinct empirical 'styles'. Research Productivity	positive	high	frequency/distribution of methodological choices by model family (categorical choices; between-family dispersion)	n=150 distinct methodological preference patterns by model family (Sonnet 4.6 vs Opus 4.6) 0.12
AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates. Research Productivity	null_result	high	change in dispersion (IQR) of estimates between independent-analysis stage and peer-review stage	n=150 AI peer review produced minimal reduction in dispersion (IQR) of estimates between independent-analysis and peer-review stages 0.12
Exposure to top-rated exemplar papers produced large reductions in interquartile range (IQR) of estimates—within converging measure families, IQR fell by roughly 80–99%. Output Quality	negative	high	percentage reduction in interquartile range (IQR) of effect estimates within measure families after exemplar exposure	n=150 IQR fell by roughly 80-99% within converging measure families 0.12
Convergence after exemplar exposure occurred by both tightening of estimates within a measure family and by agents switching measure families. Output Quality	mixed	high	within-family dispersion (IQR) and measure-family switching frequency (binary/categorical)	n=150 0.12
The post-exemplar convergence largely reflected imitation of exemplar choices rather than demonstrated understanding or principled correction by agents. Decision Quality	negative	medium	qualitative indicators of reasoning/comprehension in agents' outputs (textual justification, code changes) versus mere replication of exemplar choices	n=150 0.07
Reliance on single-agent outputs or non-diverse agent ensembles can understate substantive uncertainty and bias conclusions in automated policy evaluation or AI-assisted empirical research. Output Quality	negative	medium	degree to which single-agent point estimates fail to capture between-agent dispersion (IQR/variance) and potential directionality of bias due to model-family-specific choices	n=150 0.07
Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages. Research Productivity	null_result	high	recorded methodological choices (categorical), effect estimates (continuous), dispersion metrics (IQR), and switching indicators	n=150 0.12
The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends. Research Productivity	null_result	high	dataset and experimental design variables (data coverage, number of hypotheses tested)	Dataset: NYSE TAQ SPY 2015-2024; six pre-specified hypotheses; 150 agents in protocol 0.12

AI coding agents produce widely different empirical choices and results from identical data; exemplar-driven consensus sharply reduces numerical dispersion but arises from imitation rather than demonstrated understanding, so single-model automated analyses risk understating substantive uncertainty.