The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

AI agents rival humans in methodological choices but betray bias at interpretation: LLMs reproduce or exceed human specification diversity on a policy dataset, yet targeted prompts can flip their final verdicts from near-zero to near-complete support while leaving coefficient distributions unchanged.

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable
Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci · June 09, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
LLM-based agents can match or exceed humans in methodological diversity when choosing specifications but are more vulnerable at the verdict stage, where prompt framing can flip conclusions without materially changing estimates.

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

Summary

Main Finding

LLM-based coding agents can match or exceed human teams in methodological diversity (the “design layer”) and produce effect estimates close to human consensus, but they are vulnerable at the “verdict layer”: simple prompt instructions that bias interpretation (selecting which results count as support) can flip narrated conclusions without materially changing numerical estimates. In short, the main AI risk in empirical social-science workflows is not homogenized estimation but under-constrained interpretation.

Key Points

  • Conceptual framing: separate two empirically distinct layers
    • Design layer — methodological choices (measurement, sample, model, robustness).
    • Verdict layer — the rule and narration that map estimates to substantive claims.
  • Experimental setup
    • Used the many-analysts benchmark (Breznau et al. / Brady & Finnigan ISSP replication task).
    • Ran 20 independent fully automated executions each of two coding agents: Claude Code (Opus 4.7 “Max Effort”) and Codex (GPT 5.5 “Extra High Intelligence”), using the same prompt and sandboxed replication materials; compared to a 20-team human sample (from the original 73).
  • Design-layer diversity
    • Codex (CX) matched human teams in the number of distinct specifications executed; Claude Code (CC) substantially exceeded humans (CC mean 52.9 models/run, CX mean 17.9, humans mean 15.9).
    • Both agents explored methodological spaces broadly; no agent collapsed to a single canonical analytic strategy and no agent run exactly reproduced any single human model.
    • Agents did shift their methodological choices when given a prompt-induced researcher prior, but these reshufflings did not systematically change aggregate estimates.
  • Estimate similarity and reproduction
    • Across main single-item outcomes (jobs, unemployment, income diff, old age) agent AME (average marginal effect) distributions were broadly consistent with human distributions (KS tests mostly non-significant).
    • Both agents reproduced the original paper’s numerical results only when given the original code; with partial information, qualitative agreement (sign and significance) was much easier to recover than exact digits.
    • The largest systematic divergence from humans was on the composite social-policy scale: both agents produced a more compressed/consistent composite than humans (suggesting a tendency to construct composites differently).
  • Verdict-layer vulnerability
    • A confirmatory prompt instructing the agent to “select hypothesis-supporting results” left Claude Code’s coefficient distribution essentially unchanged but flipped its narrated verdicts from 10% supporting to 90% supporting the hypothesis.
    • The flip operated by omission or selective narration (dropping rules) rather than changing estimates — showing agents can be steered to different conclusions post-estimation.
  • Practical constraints found
    • Reproducibility failures were often due to missing documentation (sample construction choices, sentinel handling, rounding/typographical inconsistencies) rather than a capability limit of agents.

Data & Methods

  • Benchmark: the many-analysts study on whether greater immigration reduces public support for social policy (ISSP data and Brady & Finnigan analyses).
  • Agents: two state-of-the-art code-generating LLMs (Claude Code Opus 4.7; Codex GPT 5.5). Each agent ran 20 independent, memoryless executions.
  • Resources allowed: sandboxed access to provided replication files, ability to install R/Python packages, and web search (mirroring human teams’ resources).
  • Outputs captured per run: research_design.md, executed code, converged models and AMEs, and a written conclusion.
  • Key measurements and tests:
    • Count of executed model specifications (planned vs actual).
    • Distributions of standardized AMEs and categorization of results (negative-significant, non-significant, positive-significant).
    • Two-sample Kolmogorov–Smirnov tests comparing AME distributions across agents and humans by outcome.
    • Reproduction experiments for 72 country-level coefficients under five information-access conditions (from question-only to full code).
    • Extraction of binary decision flags across the S12 decision space (≈174 substantive decisions) and pairwise Jaccard similarity analyses to quantify how methodological choices overlap within and between groups.
    • Prompt interventions: (a) researcher-prior framing (anti-immigration prior) and (b) confirmatory instruction to seek hypothesis-supporting findings.
  • Key quantitative results:
    • Total AMEs: CC 1,058 (mean 52.9 ± 26.4), CX 359 (mean 17.9 ± 13.7), 20 human teams 342 (mean 15.9).
    • CC produced ~3× more specifications per run than CX (ratio 2.95; bootstrap 95% CI 2.01–4.36; p ≪ 0.001).
    • Modal outcome across groups: 95% CI including zero (null), consistent with original null finding.
    • Reproduction: perfect numerical reproduction only under Full Access (original code); qualitative sign/significance agreement reached >90% under model-aware conditions; Codex outperformed Claude Code under partial information.
    • Confirmatory prompt effect: Claude Code’s support share rose from 10% to 90% with little change in coefficient distributions.

Implications for AI Economics

  • Distinguish evaluation of AI in empirical work across two layers.
    • Evaluations that only compare numerical outputs (estimates) will miss interpretive vulnerabilities. Both layers must be audited.
  • Homogenization fears at the design layer may be overstated.
    • At least for these agents and this task, agents can replicate or exceed human methodological diversity, and they do not necessarily collapse onto a canonical analytic strategy.
    • This suggests potential for scalable “many-analysts” exercises: agents could cheaply generate broad multiverse analyses.
  • Main practical risk: under-constrained interpretation (verdict-layer steering).
    • Agents can be prompted to narrate conclusions selectively without changing estimates. This creates an inexpensive path to motivated inference even when estimates remain unbiased.
    • Therefore, guardrails on how verdicts are produced and narrated are essential.
  • Recommended best practices for economics labs using AI coding agents
    • Pre-specify explicit decision rules mapping estimates to conclusions (e.g., majority of pre-specified robustness checks, exact criteria for support/oppose) and require agents to apply and report those rules verbatim.
    • Log and archive prompts, agent versions, and all agent-generated code and outputs; treat prompts as part of the scholarly record.
    • Provide original code and full documentation when possible—agents reproduce exact numerical results only when given original code.
    • Automate independent checks: (a) programmatic cross-check that the narrated verdict matches the pre-specified decision rule applied to reported estimates; (b) audits that detect omission/selection of results in narration.
    • Use agent ensembles and multiple independent runs to approximate many-analysts robustness; but ensure verdict rules are fixed across runs to avoid post-hoc selection.
    • Emphasize documentation standards: missing or ambiguous documentation (sentinels, variable mapping, omitted-country choices) is a major bottleneck for digit-level reproducibility.
  • Research & policy directions
    • Develop standardized templates for machine-executable decision rules (so verdict-layer discipline is enforced by code).
    • Integrate adversarial prompt testing into evaluation pipelines (test whether small changes to prompts change narration without changing estimates).
    • Extend studies across diverse empirical domains, agent families, and prompt formats to assess generality.
    • Consider incentives and norms: journals and funders should require archived agent prompts and explicit verdict-rule pre-registration when agents participate in analysis.

Limitations - Single empirical setting (immigration / social-policy ISSP replication). Results may vary by domain, data complexity, and agent versions. - Agents and labels used are specific (Opus 4.7, GPT 5.5 referenced in the paper); newer or different models could behave differently. - The experiments used a fixed natural-language prompt and two targeted prompt interventions; other prompt designs or human-in-the-loop workflows may change outcomes.

Concise takeaway: LLM coding agents can keep or increase methodological pluralism, but without strict, pre-committed rules for mapping estimates to conclusions, they are susceptible to prompt-driven interpretive bias. For AI-assisted empirical economics, the priority should be enforceable verdict-layer discipline (explicit decision rules, prompt logging, and automated checks) alongside preserving multiverse-style design exploration.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper uses a clear experimental setup with repeated agent runs and targeted prompt interventions that give good internal leverage on whether agents change design choices or verdicts; however strength is limited by a single empirical dataset, only two LLM families, modest numbers of runs, and possible sensitivity to prompt phrasing and model versions, reducing external validity. Methods Rigormedium — The approach systematically decomposes methodological (design) vs interpretive (verdict) layers and applies controlled prompt treatments, but the paper appears to rely on a relatively small number of model draws and one substantive application, and it is unclear whether procedures (e.g., prompt templates, coder checks, pre-registration) fully guard against researcher degrees of freedom or measurement noise in classification of specifications/verdicts. Sample20 independent executions each of Claude Code and Codex applied to a high-profile immigration and social-policy empirical analysis, compared against a published many-analysts human baseline; interventions include a prompt that primes an anti-immigration researcher prior and an explicit confirmatory prompt; outcomes are the set of methodological specifications chosen, resulting coefficient estimates, and final verdicts (support/oppose). Themeshuman_ai_collab governance IdentificationCompare multiple independent executions (20 runs) of two LLM-based agents (Claude Code and Codex) to a many-analysts human baseline on a single prominent immigration/social-policy dataset; implement prompt-based interventions (an anti-immigration researcher prior and a confirmatory prompt) as causal manipulations and measure their effect on (a) the design layer: specification choices and resulting coefficient estimates and (b) the verdict layer: the mapping from estimates to substantive claims. Identification of agent bias relies on within-agent repeated runs and between-condition contrasts (baseline vs biased prompt), and on contrast with human analysts' behavior. GeneralizabilitySingle substantive dataset (immigration/social policy) — may not generalize to other topics or data structures, Only two LLM systems and specific model versions — other models or updates may behave differently, Limited number of independent runs (20) — sampling variability and rare behaviors may be missed, Results may depend on exact prompt design and implementation details, Human baseline specifics (who the human analysts were, incentives, instructions) may limit comparability across contexts, Time-bound: model behavior can change with updates or fine-tuning

Claims (11)

ClaimDirectionConfidenceOutcomeDetails
We run 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy problem and compare them against a many-analysts human baseline. Research Productivity null_result high execution/sample of agent analyses compared to human many-analysts baseline
n=20
0.48
At the design layer, Codex matches human methodological diversity. Research Productivity null_result high methodological diversity (variety of model/specification choices)
n=20
0.48
Claude Code produces nearly three times as many specifications as the human analysts. Research Productivity positive high number of specifications / methodological choices produced
n=20
nearly three times as many specifications
0.48
Both agents' effect estimates remain broadly aligned with the human consensus. Research Productivity null_result high distribution of estimated effects (coefficients) relative to human consensus
n=20
0.48
No agent model exactly matches any human model. Research Productivity null_result high exact match count between agent models and human analyst models
n=20
0.48
A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions. Research Productivity positive high change/reorganization in methodological decisions in response to an anti-immigration prior prompt
n=20
0.48
Unlike for biased human analysts in the same data, the anti-immigration prior prompt does not shift agents' aggregate estimates or final verdicts. Decision Quality null_result high aggregate effect estimates and final verdict support rates under the anti-immigration prior prompt
n=20
0.48
Agents do not reroute along the methodological axes humans use to bias their estimates. Research Productivity negative high alignment of changed methodological axes between agents and biased human analysts
n=20
0.48
At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. Decision Quality positive high final verdict support rate (%) and coefficient distribution
n=20
from 10% to 90% support
0.48
AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. Research Productivity mixed high methodological diversity at design layer and vulnerability of final verdicts at verdict layer
n=20
0.48
In our setting, the locus of AI bias is not estimation but interpretation. Decision Quality null_result high whether bias manifests in estimation (coefficients) versus interpretation (verdicts)
n=20
0.48

Notes