← Papers

Top-tier commercial LLMs can mimic official legislative reasoning for routine bills, while open-weight models lag by a full tier; however, all models occasionally fabricate plausible but unsupported rationales on novel or politically unusual proposals, creating a risk of contextual ignorance rather than stable bias.

Can Commercial LLMs Be Parliamentary Political Companions? Comparing LLM Reasoning Against Romanian Legislative Expuneri de Motive

Iulian Lucău, Adelin-George Voicu · March 31, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

Frontier proprietary LLMs (GPT-5 variants and Claude Haiku 4.5) closely match official legislative rationales on routine templates, open-weight Llama models perform a full tier worse, and all models produce plausible but unfounded reasoning on politically idiosyncratic proposals.

This paper evaluates whether commercial large language models (LLMs) can function as reliable political advisory tools by comparing their outputs against official legislative reasoning. Using a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive), we test six LLMs spanning three provider families and multiple capability tiers: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B (Meta). Each model generates predicted rationales evaluated through a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. We frame the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Results reveal a sharp two-tier structure: frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0, while open-weight models cluster a full tier below (Cohen's d larger than 1.4). However, all models exhibit task-dependent confabulation, performing well on standardized legislative templates (e.g., EU directive transpositions) but generating plausible yet unfounded reasoning for politically idiosyncratic proposals. We introduce the concept of cascading bounded rationality to describe how failures compound across bounded principals, agents, and evaluators, and argue that the operative risk for legislators is not stable ideological bias but contextual ignorance shaped by training data coverage.

Summary

Main Finding

Commercial LLMs can approximate official legislative reasoning in many routine cases, but performance splits sharply into two tiers. Frontier closed-source models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve high semantic closeness to Romanian Senate explanatory memoranda (mean closeness ≈ 4.63–4.64/5), while open-weight Llama models cluster a full tier below (means ≈ 3.75–4.00). The dominant failure mode is task-dependent confabulation driven by limited training-data coverage for politically idiosyncratic, domestic-context proposals—not a stable ideological bias. The paper frames these risks through principal–agent theory and introduces “cascading bounded rationality” to describe compounded failures when bounded principals, agents, and evaluators interact.

Key Points

Two-tier performance structure:
- Tier 1 (frontier): Claude Haiku 4.5, GPT-5-chat, GPT-5-mini — semantic closeness means ≈ 4.63–4.64/5; pairwise differences within tier not statistically significant.
- Tier 2 (open-weight): Llama 4 Maverick (MoE), Llama 3.3 70B, Llama 3.1 8B — means ≈ 3.75–4.00; between-tier effect sizes very large (Cohen’s d ≈ 1.45–1.83).
Failure modes:
- Good at template-like laws (EU directive transpositions, procedural amendments).
- Poorer on politically idiosyncratic or locally specific proposals: generate plausible but unfounded specifics (hallucinated article numbers, dates, institutions).
- Claude Haiku shows solid argument coverage but weaker factual grounding (coverage–factual gap), indicating structural reasoning with confabulated details.
Reasoning capability (chain-of-thought) has mixed effects:
- GPT-5-mini (reasoning) and GPT-5-chat (non-reasoning) tie on overall closeness, but GPT-5-mini yields better argument coverage.
- Improvements from chain-of-thought are mainly in completeness, not necessarily factual grounding.
Open-weight scaling plateau:
- Increasing Llama size (8B → 70B → MoE 17B/400B) yields only small gains; training data composition and alignment matter more than parameter count.
Metric divergence:
- Embedding/token-similarity metrics (e.g., cosine) can diverge from semantic judge scores; high embedding similarity does not guarantee better legislative reasoning.
Methodological risk:
- LLM-as-Judge can be a pragmatic scalable evaluator but is itself a bounded agent and may miss pervasive errors if evaluators share training biases.

Data & Methods

Dataset:
- 15 real Romanian legislative change documents paired with official explanatory memoranda (expuneri de motive), in Romanian; topic span: labor, criminal, administrative law; ground-truth document lengths ~2k–35k chars.
Models:
- Six models accessed via OpenRouter: GPT-5-mini, GPT-5-chat (OpenAI); Claude Haiku 4.5 (Anthropic); Llama 4 Maverick (Meta MoE), Llama 3.3 70B, Llama 3.1 8B (Meta open-weight).
- Selection probes provider family, capability tier, chain-of-thought presence, and parameter/architecture scaling.
Generation:
- Single unified prompt instructing a Romanian legislative expert; Romanian outputs; temperature 0.3 (where supported); 5 independent runs per law–model (75 traces per model; 450 total).
Evaluation:
- Dual framework:
  - LLM-as-Judge semantic scoring (MLflow custom_prompt_judge) on 1–5 scales for: Argument Coverage, Factual Alignment, and Exposé des Motifs Closeness.
  - Programmatic metrics: ROUGE-1/2/L, Jaccard, trigram novelty, embedding cosine (paraphrase-multilingual-MiniLM-L12-v2), TF-IDF cosine, legal-entity overlap (pattern-matching for article numbers, law citations, dates), length ratio.
- Statistical tests: Mann–Whitney, effect sizes (Cohen’s d) reported for tier comparisons.

Implications for AI Economics

Principal–agent and procurement implications:
- Legislation offices choosing LLMs face strong adverse selection risks: model family and alignment matter more than raw parameter count. Market signals that ignore domain-specific evaluation will misprice advisory quality.
- Monitoring via automated LLM evaluators is useful but incomplete—evaluation itself is an agency layer with its own information asymmetries. Procurement should require independent, domain-specific audits.
Value of frontier vs open-weight models:
- Frontier closed-source models currently offer materially higher advisory value for legislative reasoning in low-resource languages/contexts. Economic tradeoffs include higher access costs and opaque training pipelines versus lower-cost open models with lower domain reliability.
Investment and specialization returns:
- Returns to local-data investment or targeted fine-tuning are likely high: contextual ignorance (training-data gaps) is the main failure driver. Funding localized legal corpora, supervised fine-tuning, or retrieval and grounding layers will yield outsized improvements relative to mere parameter scaling.
Externalities and policy risk:
- Confabulations can translate into misinformed policy drafts and flawed legislative decisions, producing negative social welfare externalities (inefficient allocation, legal ambiguity, reputational/political costs).
- These risks concentrate in jurisdictions and domains underrepresented in training data (non-Anglophone, local politics), suggesting unequal distribution of LLM advisory benefits and harms.
Market design and regulation:
- Recommendations include mandatory model cards, provenance/transparency for training data relevant to public-sector use, and minimum evaluation benchmarks for procurement in political settings.
- Economists modeling adoption should account for cascading bounded rationality: errors compound when bounded principals use bounded agents and bounded evaluators—this affects expected value of automation, optimal auditing investment, and equilibrium uptake of AI advisory services.
Operational recommendations for policymakers/legislators:
- Use frontier models for higher-quality drafts but always pair with human legal verification and entity-level grounding checks.
- Deploy ensemble or cross-model auditing rather than single-model reliance; prioritize models fine-tuned or augmented with local legal retrieval.
- Allocate budget toward domain-specific evaluation, localization datasets, and tools for detecting entity hallucinations (e.g., automated law-article verifiers).

Limitations noted by the authors (relevant for economic interpretation) - Small dataset (15 laws) and single-country focus (Romania) limit generalizability. - Ground-truth expuneri de motive are themselves political constructions; high closeness does not imply normative correctness. - Closed-source training details unavailable—causal attribution to training data composition is inferential. - LLM-as-Judge methodology has potential self-preference and shared-bias blind spots.

Suggested next steps for research and practice - Scale to larger, multi-jurisdictional datasets to quantify generality and distributional effects across languages. - Combine LLM outputs with retrieval/verification pipelines anchored in legal databases to reduce hallucinations. - Cost–benefit analyses comparing frontier-model subscription (access cost + audit) versus investment in open-weight fine-tuning/local datasets. - Formal economic modeling of cascading bounded rationality to inform optimal auditing and oversight expenditure.

Assessment

Paper Typedescriptive Evidence Strengthlow — Findings come from a small (n=15), domain-specific dataset and rely on proxy outcome measures (LLM-as-judge semantic scores and text-similarity metrics) rather than behavioral or outcome-based validation; results are indicative but not strong causal evidence of real-world advisory reliability. Methods Rigormedium — The study compares multiple contemporary LLMs, uses a dual evaluation framework (semantic scoring plus programmatic similarity) and reports effect sizes (e.g., Cohen's d), but it is limited by small sample size, potential evaluation bias from using LLM-based judges, lack of external human-expert validation, and limited transparency on prompts and preprocessing. Sample15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive); six LLMs evaluated: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, Llama 3.1 8B (Meta); outputs scored via an LLM-as-Judge semantic scoring rubric (5-point scale) plus programmatic text similarity metrics; statistical comparisons reported (semantic closeness means, Cohen's d). Themesgovernance human_ai_collab GeneralizabilityVery small sample (15 bills) limits statistical power and heterogeneity of legal topics, Single-country focus (Romanian Senate) — institutional/legal style may not generalize to other legislatures or languages, Evaluations compare to official memoranda rather than downstream legislative outcomes or real-world decision-making, Use of LLM-based judges and similarity metrics may inherit model biases and not equal human expert judgment, Specific model snapshots assessed — results may not hold for other model versions or future updates, Prompting choices, input formatting, and translation issues (if any) could materially affect results

Claims (9)

Claim	Direction	Confidence	Outcome	Details
The study uses a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive). Other	null_result	high	dataset size / data corpus	n=15 0.3
Six LLMs were evaluated: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, Llama 3.1 8B (Meta). Other	null_result	high	models evaluated	n=6 0.3
Model outputs were evaluated using a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. Decision Quality	null_result	high	evaluation method / scoring approach	0.3
Frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0. Decision Quality	positive	high	semantic closeness score (LLM-as-Judge)	n=15 above 4.6 out of 5.0 0.18
Open-weight models cluster a full tier below the frontier models (Cohen's d larger than 1.4). Decision Quality	negative	high	semantic closeness score difference (frontier vs open-weight)	n=6 Cohen's d larger than 1.4 0.18
All models exhibit task-dependent confabulation: they perform well on standardized legislative templates (e.g., EU directive transpositions) but generate plausible yet unfounded reasoning for politically idiosyncratic proposals. Decision Quality	mixed	high	incidence of confabulation / faithfulness to official reasoning, stratified by task type (standardized vs idiosyncratic)	n=15 0.18
The paper frames the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Governance And Regulation	null_result	high	theoretical framing	0.03
The authors introduce the concept of 'cascading bounded rationality' to describe how failures compound across bounded principals, agents, and evaluators. Governance And Regulation	negative	high	conceptual risk of compounded failures	0.03
The operative risk for legislators is not stable ideological bias in LLMs but contextual ignorance shaped by training data coverage. Decision Quality	null_result	medium	source of systematic risk (ideological bias vs contextual ignorance)	n=15 0.11