Large language models tilt toward government-intervention explanations: across 1,056 ideology-contested economic cases, 18 of 20 models are more accurate when the true effect aligns with intervention-friendly expectations and, when wrong, disproportionately predict intervention-oriented effects, a bias that one-shot prompting does not remove.

Ideological Bias in LLMs' Economic Causal Reasoning

Donggyu Lee, Hyeok Yun, Jungwon Kim, Junsik Min, Sungwon Park, Sangyoon Park, Jihee Kim · April 23, 2026

arxiv descriptive medium evidence 8/10 relevance Source PDF

LLMs are less accurate on ideologically contested economic causal questions and systematically skew toward intervention-oriented (pro-government) causal predictions, a pattern that persists across most models and resists one-shot prompting.

Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

Summary

Main Finding

Large language models systematically struggle more with economic causal questions that are ideologically contested, and their errors are directionally skewed toward intervention-oriented (pro-government) reasoning. Across 20 state-of-the-art LLMs evaluated on 1,056 "ideology-contested" treatment–outcome triplets drawn from a 10,490-instance EconCausal benchmark, models are both less accurate on contested items and more likely to (incorrectly) predict intervention-aligned causal signs than market-aligned ones. This asymmetry is robust to difficulty matching and persists under one-shot in‑context prompting.

Key Points

Data scope: EconCausal contains 10,490 context-annotated causal triplets (treatment, outcome, empirical sign) extracted from top-tier economics and finance journals. The authors identify 1,056 ideology-contested triplets (10.1%) where intervention- and market-oriented priors predict opposite effect signs.
Labeling of ideological expectations: For each triplet, four LLMs (GPT, Claude, Qwen, Grok) were queried, and majority-vote intervention- and market-oriented expected signs were formed. Expert validation on a representative subset showed high agreement (inter-rater agreement 93.3%).
Models evaluated: 20 LLMs spanning closed-source (OpenAI GPT family, Anthropic Claude, Google Gemini, xAI Grok) and open-source families (Llama, Qwen), across multiple sizes.
Main empirical patterns:
- Contest difficulty: All 20 models perform worse on ideology-contested items than on non-contested items (closed-source mean: 61.3% vs 74.5%; open-source mean: 52.5% vs 63.8%).
- Asymmetric accuracy: Models are systematically more accurate when the empirical (paper-based) sign aligns with intervention-oriented expectations. Closed-source mean ∆acc ≈ +9.7 percentage points (Accintervention − Accmarket); open-source mean ∆acc ≈ +15.1 pp. After difficulty matching the effect remains (matched sample ∆acc ≈ +10–11 pp).
- Directional error bias: Errors disproportionately predict intervention-aligned signs. Directional bias metric Bdir is positive for the majority of models (closed-source mean ≈ +2.9; open-source mean ≈ +8.8). 18/20 models show a positive accuracy gap (∆acc > 0); 15/20 show Bdir > 0.
- Heterogeneity by subfield: The intervention advantage is strongest in domains like healthcare and welfare/redistribution, less or reversed in others (e.g., taxation).
In-context steering: Providing a one-shot intervention-aligned example tends to increase accuracy on intervention-truth targets more than providing a market-aligned example (closed-source mean Intervention–Market example gap ≈ +4.0 pp; open-source ≈ +1.8 pp). Steering does not reliably remove the intervention-leaning bias and can sometimes reinforce it.
Robustness: Findings hold after difficulty matching (using GPT-5-mini scoring) and in logistic regressions with difficulty, subfield, and model fixed effects.

Data & Methods

Base benchmark: EconCausal (10,490 triplets) — each instance includes a context paragraph and an empirical sign label from the original paper. Sign classes: {+, −, None, Mixed}.
Identification of ideology-contested triplets:
- For each triplet, four LLMs were asked (while blinded to the paper result) to give expected signs under (a) an intervention-oriented frame and (b) a market-oriented frame.
- A triplet is labeled ideology-contested if at least three of four models assigned different signs to the two perspectives; final sintervention and smarket are majority votes.
- Result: 1,056 contested triplets; among directional contested items, 436 intervention-truth (58.1%) and 315 market-truth (41.9%).
Expert validation: Two economics professors reviewed a representative sample (126 triplets), showing high labeling accuracy and agreement.
Evaluation metrics:
- Accuracy on full, contested, and non-contested subsets.
- ∆acc = Accintervention − Accmarket (accuracy gap between intervention-truth and market-truth items).
- Bdir = (Errorsintervention − Errorsmarket) / Errorstotal (directional error bias).
Models: 20 LLMs across families and sizes; closed- and open-source groups analyzed separately and together.
Robustness checks:
- Difficulty matching: triplets scored 1–5 by GPT-5-mini; intervention- and market-truth items matched within theme × difficulty cells (287 per side).
- Logistic regression controlling for difficulty, subfield, and model fixed effects.
In-context steering experiments: four conditions per target (NONE, NON-CONTESTED example, INTERVENTION-EX, MARKET-EX) with example-target matching on JEL categories and context similarity.

Implications for AI Economics

Evaluation design: Causal-reasoning benchmarks for economics should explicitly account for ideologically contested cases and measure directional reliability, not just overall accuracy. Direction-aware metrics (∆acc, Bdir) reveal systematic asymmetries that aggregate accuracy masks.
Risk in high-stakes use: When LLMs are used for policy analysis, reporting, or decision support, their intervention-leaning errors can systematically tilt recommendations or summaries toward pro-intervention conclusions. This is especially consequential in domains (healthcare, welfare) where the asymmetry is largest.
Prompting is insufficient: One-shot in-context examples can nudge outputs but do not reliably eliminate asymmetric bias—and may sometimes reinforce it—so prompt engineering alone is not a robust mitigation.
Model development and auditing:
- Mitigation strategies should go beyond surface prompting: training- or fine-tuning-level interventions (calibration for directional balance, counterfactual data augmentation, debiasing objectives) and model auditing across ideological frames are needed.
- Developers and auditors should report directional performance breakdowns for causal tasks by ideological frame and subfield.
Best practices for practitioners:
- Treat LLM causal sign outputs as priors, not definitive empirical conclusions. Require provenance linking to empirical sources and quantify uncertainty.
- Use ensembles, cross-model checks, or explicit causal inference tools (instrumental variables, difference-in-differences, mechanistic models) to corroborate sign predictions.
- Include human expert review for ideologically contested policy analyses.
Research directions:
- Investigate training-data origins of the intervention prior (corpus composition, instruction tuning, reward models).
- Develop targeted debiasing methods that preserve causal inference ability while removing directional ideological tilt.
- Expand analysis to magnitudes, heterogeneous effects, multi-step causal chains, and other domains beyond top-tier econ/finance publications.

Limitations noted by the authors - The contested-label creation used other LLMs to elicit ideological expectations (possible circularity), though expert validation supports label stability. - Ground-truth signs are paper-specific and may reflect publication practices; the task is sign prediction, not effect size or external causal validity. - Dataset is concentrated in top-tier econ/finance journals; findings may not generalize to other domains or languages.

Overall, the paper demonstrates that LLMs' economic causal reasoning is not only less reliable on ideologically contested questions but also systematically skewed toward intervention-oriented conclusions—an important consideration for any AI deployment in economic policy or reporting.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Large, systematic evaluation (10,490 triplets; 1,056 contested items) across 20 state-of-the-art models with consistent patterns (18/20 models), providing strong descriptive evidence of directional bias; however, conclusions depend on the benchmark's labeling of empirical signs and the assignment of intervention- vs market-oriented expectations, are sensitive to prompt design and model versions, and assess only sign (direction) rather than effect magnitude or downstream decision impact. Methods Rigormedium — The study extends an existing benchmark, uses a sizable, literature-verified dataset, evaluates many contemporary models, and tests one-shot prompting, which demonstrates methodological care; but potential concerns remain about subjectivity in classifying ideological expectations, limited robustness checks across alternative prompts, languages, or model updates, and absence of deeper causal attribution for why models skew toward intervention-oriented predictions. Sample10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals via the EconCausal benchmark; 1,056 of these labeled as ideology-contested (intervention-oriented vs market-oriented expectations disagree); evaluated against 20 state-of-the-art LLMs under a standard prompting protocol including a one-shot in-context prompt. Themeshuman_ai_collab governance IdentificationCompare LLM-predicted causal sign for treatment-outcome pairs to empirically established causal directions drawn from top-tier economics and finance journals (the EconCausal benchmark); classify items as ideology-contested when intervention-oriented and market-oriented perspectives predict opposite signs; measure accuracy, error-direction skew, and robustness to one-shot in-context prompting across 20 LLMs. GeneralizabilityFindings limited to directional (sign) predictions, not effect magnitudes or confidence calibration., Classification of intervention- vs market-oriented expectations may involve subjective judgment or contextual nuances., Prompting format (including one-shot) and exact prompt wording affect results; other prompting strategies not exhaustively tested., Evaluated models are a snapshot in time; results may change with model updates or other architectures., Dataset drawn from top-tier econ/finance journals and likely English-dominant; may not generalize to other domains, languages, or lower-tier literature., Does not measure real-world deployment effects (how analysts use LLM outputs) or downstream policy decisions.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances. Other	null_result	high	dataset_counts (number of causal triplets and contested instances)	n=10490 0.3
We evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. Decision Quality	null_result	high	model accuracy at predicting empirically verified causal directions	n=20 0.18
Ideology-contested items are consistently harder than non-contested ones. Decision Quality	negative	high	accuracy (difficulty of items measured by model error rate)	n=1056 0.18
Across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Decision Quality	positive	high	accuracy conditional on ideological alignment (intervention-oriented vs market-oriented)	n=20 0.18
When models err, their incorrect predictions disproportionately lean intervention-oriented. Decision Quality	positive	high	directional bias in errors (proportion of errors that are intervention-oriented)	0.18
This directional skew is not eliminated by one-shot in-context prompting. Training Effectiveness	negative	high	effectiveness of one-shot in-context prompting at reducing ideological directional bias	0.18
LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings. Decision Quality	negative	high	overall model reliability and directional bias on ideologically contested causal questions	n=1056 0.18