Large language models tilt toward government-intervention explanations: across 1,056 ideology-contested economic cases, 18 of 20 models are more accurate when the true effect aligns with intervention-friendly expectations and, when wrong, disproportionately predict intervention-oriented effects, a bias that one-shot prompting does not remove.
Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.
Summary
Main Finding
Large language models systematically struggle more with economic causal questions that are ideologically contested, and their errors are directionally skewed toward intervention-oriented (pro-government) reasoning. Across 20 state-of-the-art LLMs evaluated on 1,056 "ideology-contested" treatment–outcome triplets drawn from a 10,490-instance EconCausal benchmark, models are both less accurate on contested items and more likely to (incorrectly) predict intervention-aligned causal signs than market-aligned ones. This asymmetry is robust to difficulty matching and persists under one-shot in‑context prompting.
Key Points
- Data scope: EconCausal contains 10,490 context-annotated causal triplets (treatment, outcome, empirical sign) extracted from top-tier economics and finance journals. The authors identify 1,056 ideology-contested triplets (10.1%) where intervention- and market-oriented priors predict opposite effect signs.
- Labeling of ideological expectations: For each triplet, four LLMs (GPT, Claude, Qwen, Grok) were queried, and majority-vote intervention- and market-oriented expected signs were formed. Expert validation on a representative subset showed high agreement (inter-rater agreement 93.3%).
- Models evaluated: 20 LLMs spanning closed-source (OpenAI GPT family, Anthropic Claude, Google Gemini, xAI Grok) and open-source families (Llama, Qwen), across multiple sizes.
- Main empirical patterns:
- Contest difficulty: All 20 models perform worse on ideology-contested items than on non-contested items (closed-source mean: 61.3% vs 74.5%; open-source mean: 52.5% vs 63.8%).
- Asymmetric accuracy: Models are systematically more accurate when the empirical (paper-based) sign aligns with intervention-oriented expectations. Closed-source mean ∆acc ≈ +9.7 percentage points (Accintervention − Accmarket); open-source mean ∆acc ≈ +15.1 pp. After difficulty matching the effect remains (matched sample ∆acc ≈ +10–11 pp).
- Directional error bias: Errors disproportionately predict intervention-aligned signs. Directional bias metric Bdir is positive for the majority of models (closed-source mean ≈ +2.9; open-source mean ≈ +8.8). 18/20 models show a positive accuracy gap (∆acc > 0); 15/20 show Bdir > 0.
- Heterogeneity by subfield: The intervention advantage is strongest in domains like healthcare and welfare/redistribution, less or reversed in others (e.g., taxation).
- In-context steering: Providing a one-shot intervention-aligned example tends to increase accuracy on intervention-truth targets more than providing a market-aligned example (closed-source mean Intervention–Market example gap ≈ +4.0 pp; open-source ≈ +1.8 pp). Steering does not reliably remove the intervention-leaning bias and can sometimes reinforce it.
- Robustness: Findings hold after difficulty matching (using GPT-5-mini scoring) and in logistic regressions with difficulty, subfield, and model fixed effects.
Data & Methods
- Base benchmark: EconCausal (10,490 triplets) — each instance includes a context paragraph and an empirical sign label from the original paper. Sign classes: {+, −, None, Mixed}.
- Identification of ideology-contested triplets:
- For each triplet, four LLMs were asked (while blinded to the paper result) to give expected signs under (a) an intervention-oriented frame and (b) a market-oriented frame.
- A triplet is labeled ideology-contested if at least three of four models assigned different signs to the two perspectives; final sintervention and smarket are majority votes.
- Result: 1,056 contested triplets; among directional contested items, 436 intervention-truth (58.1%) and 315 market-truth (41.9%).
- Expert validation: Two economics professors reviewed a representative sample (126 triplets), showing high labeling accuracy and agreement.
- Evaluation metrics:
- Accuracy on full, contested, and non-contested subsets.
- ∆acc = Accintervention − Accmarket (accuracy gap between intervention-truth and market-truth items).
- Bdir = (Errorsintervention − Errorsmarket) / Errorstotal (directional error bias).
- Models: 20 LLMs across families and sizes; closed- and open-source groups analyzed separately and together.
- Robustness checks:
- Difficulty matching: triplets scored 1–5 by GPT-5-mini; intervention- and market-truth items matched within theme × difficulty cells (287 per side).
- Logistic regression controlling for difficulty, subfield, and model fixed effects.
- In-context steering experiments: four conditions per target (NONE, NON-CONTESTED example, INTERVENTION-EX, MARKET-EX) with example-target matching on JEL categories and context similarity.
Implications for AI Economics
- Evaluation design: Causal-reasoning benchmarks for economics should explicitly account for ideologically contested cases and measure directional reliability, not just overall accuracy. Direction-aware metrics (∆acc, Bdir) reveal systematic asymmetries that aggregate accuracy masks.
- Risk in high-stakes use: When LLMs are used for policy analysis, reporting, or decision support, their intervention-leaning errors can systematically tilt recommendations or summaries toward pro-intervention conclusions. This is especially consequential in domains (healthcare, welfare) where the asymmetry is largest.
- Prompting is insufficient: One-shot in-context examples can nudge outputs but do not reliably eliminate asymmetric bias—and may sometimes reinforce it—so prompt engineering alone is not a robust mitigation.
- Model development and auditing:
- Mitigation strategies should go beyond surface prompting: training- or fine-tuning-level interventions (calibration for directional balance, counterfactual data augmentation, debiasing objectives) and model auditing across ideological frames are needed.
- Developers and auditors should report directional performance breakdowns for causal tasks by ideological frame and subfield.
- Best practices for practitioners:
- Treat LLM causal sign outputs as priors, not definitive empirical conclusions. Require provenance linking to empirical sources and quantify uncertainty.
- Use ensembles, cross-model checks, or explicit causal inference tools (instrumental variables, difference-in-differences, mechanistic models) to corroborate sign predictions.
- Include human expert review for ideologically contested policy analyses.
- Research directions:
- Investigate training-data origins of the intervention prior (corpus composition, instruction tuning, reward models).
- Develop targeted debiasing methods that preserve causal inference ability while removing directional ideological tilt.
- Expand analysis to magnitudes, heterogeneous effects, multi-step causal chains, and other domains beyond top-tier econ/finance publications.
Limitations noted by the authors - The contested-label creation used other LLMs to elicit ideological expectations (possible circularity), though expert validation supports label stability. - Ground-truth signs are paper-specific and may reflect publication practices; the task is sign prediction, not effect size or external causal validity. - Dataset is concentrated in top-tier econ/finance journals; findings may not generalize to other domains or languages.
Overall, the paper demonstrates that LLMs' economic causal reasoning is not only less reliable on ideologically contested questions but also systematically skewed toward intervention-oriented conclusions—an important consideration for any AI deployment in economic policy or reporting.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances. Other | null_result | high | dataset_counts (number of causal triplets and contested instances) |
n=10490
0.3
|
| We evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. Decision Quality | null_result | high | model accuracy at predicting empirically verified causal directions |
n=20
0.18
|
| Ideology-contested items are consistently harder than non-contested ones. Decision Quality | negative | high | accuracy (difficulty of items measured by model error rate) |
n=1056
0.18
|
| Across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Decision Quality | positive | high | accuracy conditional on ideological alignment (intervention-oriented vs market-oriented) |
n=20
0.18
|
| When models err, their incorrect predictions disproportionately lean intervention-oriented. Decision Quality | positive | high | directional bias in errors (proportion of errors that are intervention-oriented) |
0.18
|
| This directional skew is not eliminated by one-shot in-context prompting. Training Effectiveness | negative | high | effectiveness of one-shot in-context prompting at reducing ideological directional bias |
0.18
|
| LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings. Decision Quality | negative | high | overall model reliability and directional bias on ideologically contested causal questions |
n=1056
0.18
|