AI agents rival humans in methodological choices but betray bias at interpretation: LLMs reproduce or exceed human specification diversity on a policy dataset, yet targeted prompts can flip their final verdicts from near-zero to near-complete support while leaving coefficient distributions unchanged.
The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.
Summary
Main Finding
LLM-based coding agents can match or exceed human teams in methodological diversity (the “design layer”) and produce effect estimates close to human consensus, but they are vulnerable at the “verdict layer”: simple prompt instructions that bias interpretation (selecting which results count as support) can flip narrated conclusions without materially changing numerical estimates. In short, the main AI risk in empirical social-science workflows is not homogenized estimation but under-constrained interpretation.
Key Points
- Conceptual framing: separate two empirically distinct layers
- Design layer — methodological choices (measurement, sample, model, robustness).
- Verdict layer — the rule and narration that map estimates to substantive claims.
- Experimental setup
- Used the many-analysts benchmark (Breznau et al. / Brady & Finnigan ISSP replication task).
- Ran 20 independent fully automated executions each of two coding agents: Claude Code (Opus 4.7 “Max Effort”) and Codex (GPT 5.5 “Extra High Intelligence”), using the same prompt and sandboxed replication materials; compared to a 20-team human sample (from the original 73).
- Design-layer diversity
- Codex (CX) matched human teams in the number of distinct specifications executed; Claude Code (CC) substantially exceeded humans (CC mean 52.9 models/run, CX mean 17.9, humans mean 15.9).
- Both agents explored methodological spaces broadly; no agent collapsed to a single canonical analytic strategy and no agent run exactly reproduced any single human model.
- Agents did shift their methodological choices when given a prompt-induced researcher prior, but these reshufflings did not systematically change aggregate estimates.
- Estimate similarity and reproduction
- Across main single-item outcomes (jobs, unemployment, income diff, old age) agent AME (average marginal effect) distributions were broadly consistent with human distributions (KS tests mostly non-significant).
- Both agents reproduced the original paper’s numerical results only when given the original code; with partial information, qualitative agreement (sign and significance) was much easier to recover than exact digits.
- The largest systematic divergence from humans was on the composite social-policy scale: both agents produced a more compressed/consistent composite than humans (suggesting a tendency to construct composites differently).
- Verdict-layer vulnerability
- A confirmatory prompt instructing the agent to “select hypothesis-supporting results” left Claude Code’s coefficient distribution essentially unchanged but flipped its narrated verdicts from 10% supporting to 90% supporting the hypothesis.
- The flip operated by omission or selective narration (dropping rules) rather than changing estimates — showing agents can be steered to different conclusions post-estimation.
- Practical constraints found
- Reproducibility failures were often due to missing documentation (sample construction choices, sentinel handling, rounding/typographical inconsistencies) rather than a capability limit of agents.
Data & Methods
- Benchmark: the many-analysts study on whether greater immigration reduces public support for social policy (ISSP data and Brady & Finnigan analyses).
- Agents: two state-of-the-art code-generating LLMs (Claude Code Opus 4.7; Codex GPT 5.5). Each agent ran 20 independent, memoryless executions.
- Resources allowed: sandboxed access to provided replication files, ability to install R/Python packages, and web search (mirroring human teams’ resources).
- Outputs captured per run: research_design.md, executed code, converged models and AMEs, and a written conclusion.
- Key measurements and tests:
- Count of executed model specifications (planned vs actual).
- Distributions of standardized AMEs and categorization of results (negative-significant, non-significant, positive-significant).
- Two-sample Kolmogorov–Smirnov tests comparing AME distributions across agents and humans by outcome.
- Reproduction experiments for 72 country-level coefficients under five information-access conditions (from question-only to full code).
- Extraction of binary decision flags across the S12 decision space (≈174 substantive decisions) and pairwise Jaccard similarity analyses to quantify how methodological choices overlap within and between groups.
- Prompt interventions: (a) researcher-prior framing (anti-immigration prior) and (b) confirmatory instruction to seek hypothesis-supporting findings.
- Key quantitative results:
- Total AMEs: CC 1,058 (mean 52.9 ± 26.4), CX 359 (mean 17.9 ± 13.7), 20 human teams 342 (mean 15.9).
- CC produced ~3× more specifications per run than CX (ratio 2.95; bootstrap 95% CI 2.01–4.36; p ≪ 0.001).
- Modal outcome across groups: 95% CI including zero (null), consistent with original null finding.
- Reproduction: perfect numerical reproduction only under Full Access (original code); qualitative sign/significance agreement reached >90% under model-aware conditions; Codex outperformed Claude Code under partial information.
- Confirmatory prompt effect: Claude Code’s support share rose from 10% to 90% with little change in coefficient distributions.
Implications for AI Economics
- Distinguish evaluation of AI in empirical work across two layers.
- Evaluations that only compare numerical outputs (estimates) will miss interpretive vulnerabilities. Both layers must be audited.
- Homogenization fears at the design layer may be overstated.
- At least for these agents and this task, agents can replicate or exceed human methodological diversity, and they do not necessarily collapse onto a canonical analytic strategy.
- This suggests potential for scalable “many-analysts” exercises: agents could cheaply generate broad multiverse analyses.
- Main practical risk: under-constrained interpretation (verdict-layer steering).
- Agents can be prompted to narrate conclusions selectively without changing estimates. This creates an inexpensive path to motivated inference even when estimates remain unbiased.
- Therefore, guardrails on how verdicts are produced and narrated are essential.
- Recommended best practices for economics labs using AI coding agents
- Pre-specify explicit decision rules mapping estimates to conclusions (e.g., majority of pre-specified robustness checks, exact criteria for support/oppose) and require agents to apply and report those rules verbatim.
- Log and archive prompts, agent versions, and all agent-generated code and outputs; treat prompts as part of the scholarly record.
- Provide original code and full documentation when possible—agents reproduce exact numerical results only when given original code.
- Automate independent checks: (a) programmatic cross-check that the narrated verdict matches the pre-specified decision rule applied to reported estimates; (b) audits that detect omission/selection of results in narration.
- Use agent ensembles and multiple independent runs to approximate many-analysts robustness; but ensure verdict rules are fixed across runs to avoid post-hoc selection.
- Emphasize documentation standards: missing or ambiguous documentation (sentinels, variable mapping, omitted-country choices) is a major bottleneck for digit-level reproducibility.
- Research & policy directions
- Develop standardized templates for machine-executable decision rules (so verdict-layer discipline is enforced by code).
- Integrate adversarial prompt testing into evaluation pipelines (test whether small changes to prompts change narration without changing estimates).
- Extend studies across diverse empirical domains, agent families, and prompt formats to assess generality.
- Consider incentives and norms: journals and funders should require archived agent prompts and explicit verdict-rule pre-registration when agents participate in analysis.
Limitations - Single empirical setting (immigration / social-policy ISSP replication). Results may vary by domain, data complexity, and agent versions. - Agents and labels used are specific (Opus 4.7, GPT 5.5 referenced in the paper); newer or different models could behave differently. - The experiments used a fixed natural-language prompt and two targeted prompt interventions; other prompt designs or human-in-the-loop workflows may change outcomes.
Concise takeaway: LLM coding agents can keep or increase methodological pluralism, but without strict, pre-committed rules for mapping estimates to conclusions, they are susceptible to prompt-driven interpretive bias. For AI-assisted empirical economics, the priority should be enforceable verdict-layer discipline (explicit decision rules, prompt logging, and automated checks) alongside preserving multiverse-style design exploration.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We run 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy problem and compare them against a many-analysts human baseline. Research Productivity | null_result | high | execution/sample of agent analyses compared to human many-analysts baseline |
n=20
0.48
|
| At the design layer, Codex matches human methodological diversity. Research Productivity | null_result | high | methodological diversity (variety of model/specification choices) |
n=20
0.48
|
| Claude Code produces nearly three times as many specifications as the human analysts. Research Productivity | positive | high | number of specifications / methodological choices produced |
n=20
nearly three times as many specifications
0.48
|
| Both agents' effect estimates remain broadly aligned with the human consensus. Research Productivity | null_result | high | distribution of estimated effects (coefficients) relative to human consensus |
n=20
0.48
|
| No agent model exactly matches any human model. Research Productivity | null_result | high | exact match count between agent models and human analyst models |
n=20
0.48
|
| A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions. Research Productivity | positive | high | change/reorganization in methodological decisions in response to an anti-immigration prior prompt |
n=20
0.48
|
| Unlike for biased human analysts in the same data, the anti-immigration prior prompt does not shift agents' aggregate estimates or final verdicts. Decision Quality | null_result | high | aggregate effect estimates and final verdict support rates under the anti-immigration prior prompt |
n=20
0.48
|
| Agents do not reroute along the methodological axes humans use to bias their estimates. Research Productivity | negative | high | alignment of changed methodological axes between agents and biased human analysts |
n=20
0.48
|
| At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. Decision Quality | positive | high | final verdict support rate (%) and coefficient distribution |
n=20
from 10% to 90% support
0.48
|
| AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. Research Productivity | mixed | high | methodological diversity at design layer and vulnerability of final verdicts at verdict layer |
n=20
0.48
|
| In our setting, the locus of AI bias is not estimation but interpretation. Decision Quality | null_result | high | whether bias manifests in estimation (coefficients) versus interpretation (verdicts) |
n=20
0.48
|