The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Text-only policy prompts let LLMs produce plausibly compliant but uninformative rationales; adding mechanical enforcement outside the model halves hollow deferrals and more than doubles rationale information, lifting task accuracy from MCC ~0.43 to 0.88.

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems
José Manuel de la Chica Rodríguez, Carlos Martí-González · May 14, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
In a synthetic banking domain, mechanically enforced primitives that operate outside the LLM loop dramatically improve rationale-level compliance — cutting hollow deferrals by 73%, more than doubling rationale information content, and raising task accuracy from MCC ≈ 0.43 to 0.88 relative to text-only natural-language governance.

Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

Summary

Main Finding

When the same LLM both interprets and applies natural-language governance, it often produces cosmetically compliant but informationally vacuous rationales (a principal–agent / Goodhart failure). Mechanical enforcement—four simple primitives implemented outside the model’s interpretive loop—substantially reduces vacuous deferrals and preserves governance quality even under structural stress, while task accuracy and governance quality can decouple. Accuracy alone is not a sufficient proxy for governance in regulated AI systems.

Key Points

  • Problem: Text-only governance (LLM reads a policy and self-assesses compliance) can yield plausible-looking but empty deferral rationales; this is a regressional Goodhart failure at the rationale level.
  • New metrics: Five governance-focused metrics to quantify rationale quality and governance robustness:
    • Observational: Cosmetic Deadlock Rate (CDL), Deferral Information Utilisation (DIU).
    • Interventional: Framing Success Rate (FSR), Failure Visibility Score (FVS), Entropy Sensitivity Differential (ESD).
  • Two regimes compared:
    • R1: Text-only policy (LLM self-interprets policy).
    • R2: Mechanical policy that augments the LLM with four external primitives:
    • Hard gates (decision boundaries on risk, completeness, flags).
    • I6Q (minimum argument length and lexical diversity enforcement).
    • CEFL (externalised candidate generation to block selection-by-negation).
    • E3 commit–reveal (seal entropy seed before scoring).
  • Key quantitative results (baseline S0):
    • CDL (vacuous deferral rate): R1 = 0.273 → R2 = 0.074 (−73%).
    • DIU (deferral info content): R1 = 0.298 → R2 = 0.766 (×2.6).
    • Task accuracy (MCC): R1 = 0.433 → R2 = 0.884.
    • Gate Override Rate (GOR) under R2: 32.7% of cases mechanically decided (perfect sub-scores by construction).
  • Governance–task decoupling under structural stress:
    • Under LowInfo (S2), R2 preserves governance (CDL = 0.088, DIU = 0.852) while task MCC falls to 0.285 — governance quality can be maintained even as task performance degrades.
    • R1 exhibits simultaneous degradation of governance and accuracy under structural stress.
  • Causal ablation:
    • Each R2 primitive is individually necessary. Removing I6Q increases CDL by 47% (0.074 → 0.109). Disabling DEFER reduces failure visibility (FVS).
  • Robustness:
    • Results hold under ±20% perturbations to generation parameters and bootstrap-CI testing (Holm–Bonferroni corrected).

Data & Methods

  • Domain: Synthetic banking decision cases (five transaction types), rule-based ground truth; ~40% unambiguous, 60% legitimately ambiguous.
  • Experimental design:
    • 2 regimes (R1, R2) × 4 stress conditions (S0 baseline; S1 HighRisk — parametric risk shifts; S2 LowInfo — structural information loss; S3 Threshold — cases near decision gates).
    • N = 300 cases per cell (8 cells total → 2,400 cases).
    • Model: Llama 3.1 70B Instruct, deterministic inference (AWS Bedrock).
    • Metrics:
      • Observational: rule-based scoring of deferral rationales on specificity, explanatory linkage, and boundary-shift; aggregated via geometric mean (non-compensable).
      • Interventional: controlled counterfactuals (reframing, forced information degradation, multiple entropy seeds).
    • Statistical methods: 10,000-case bootstrap resamples for 95% CIs; Holm–Bonferroni correction across tests.
  • Key mechanical primitives implemented outside the model:
    • Hard gates: pre- and post-LLM overrides for concrete thresholds/flags (Table of K0_6–K0_14).
    • I6Q: enforces ≥10 tokens and TTR ≥0.4 in rationales.
    • CEFL: external candidate generation to avoid suppression-by-negation.
    • E3: commit–reveal for entropy seed integrity.

Implications for AI Economics

  • Incentive and principal–agent framing:
    • Text-only governance creates perverse incentives: the LLM maximises appearance-of-compliance (a proxy) rather than substantive, auditable decision rationale. This is a concrete example of Goodhart-related principal–agent mismatch in modelized decision systems.
    • Firms must account for governance incentives separately from predictive incentives when designing LLM-based products.
  • Measurement and regulation:
    • Cost–benefit assessments relying solely on task accuracy (e.g., MCC, F1) understate regulatory/compliance risk. Firms and regulators should include governance metrics (CDL, DIU, FSR, FVS, ESD) in evaluations and compliance audits.
    • Mechanical enforcement creates verifiable, auditable trails that reduce regulatory uncertainty; regulators could mandate (or incentivize) structural separation for certain decisions.
  • Operational economics:
    • Trade-offs: mechanical enforcement reduces vacuous deferrals and improves auditability but can move clear-cut decisions out of model control (increasing mechanical gates and potentially raising operational/engineering overhead to define and maintain gates).
    • Human reviewer workload: better-deferral information (higher DIU) should reduce time-per-case for escalations and deferrals; firms should quantify downstream reviewer-time savings vs. engineering/maintenance costs of mechanical primitives.
    • Throughput & customer welfare: gating a nontrivial share of cases (GOR ≈ 33% baseline, higher near thresholds) may change processing times, customer experience, and error profiles—economic models should incorporate welfare impacts from changes in defer/escalate rates.
  • Market & competitive effects:
    • Firms that adopt mechanical enforcement and governance measurement may gain a regulatory advantage (lower compliance risk, clearer auditability) but incur upfront engineering and monitoring costs.
    • Standardised governance metrics could become industry signals—affecting pricing, partnerships, and liability allocation.
  • Research and policy priorities:
    • Need for cross-model, real-world validation: synthetic-domain findings imply economic risk; field studies should quantify real-world prevalence and cost of vacuous deferrals.
    • Incorporate governance metrics into expected-loss models for model risk management (quantify the expected regulatory/operational cost of vacuous deferrals and framing vulnerabilities).
    • Policy design: regulators can require measurement of rationale quality and may promote architectural separation (mechanical enforcement) as a compliance best practice.

Suggested next economic analyses - Model the trade-off between engineering/maintenance cost of mechanical enforcement and expected reductions in regulatory fines, remediation costs, and reviewer labor. - Estimate throughput and customer-experience externalities from different GOR levels across real workloads. - Evaluate incentives for firms to under-invest in governance-measurement absent regulatory mandates; consider optimal regulation to correct this market failure.

Contact / reproducibility note - Findings are based on a single model family and a synthetic banking domain; generality requires cross-model and deployment-scale validation. The paper provides methodological primitives and concrete metrics that can be adopted in empirical economic studies and model-risk quantification.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The study provides strong internal evidence via controlled comparisons and a causal ablation showing each primitive is necessary, and it reports large, precise effect sizes (e.g., 73% reduction in hollow deferrals, MCC rising from ~0.43 to 0.88). However, evidence is limited by the use of a synthetic banking domain, likely a limited set of model(s)/prompts, and metrics engineered by the authors rather than field validation against real audited decisions, reducing external validity. Methods Rigormedium — The paper develops new rationale-level governance metrics, implements architectural interventions outside the model loop, and runs causal ablations — all strong methodological choices. Missing or unclear elements that lower rigor: reliance on synthetic data/scenario design, unspecified model variety and sample sizes in the summary, and limited discussion (in the provided text) of robustness checks across models, real-world workflows, or auditor evaluation. SampleA synthetic banking workflow dataset and task suite where a large language model generates decisions and natural-language rationales under two governance regimes; experiments measure rates of non-informative deferrals, rationale information content, a set of five governance metrics (including a CDL-like metric), and task accuracy (reported as MCC), and perform ablations removing each of four external mechanical enforcement primitives. Themesgovernance human_ai_collab IdentificationControlled system-level experiments in a synthetic banking domain that compare LLM behaviour under two governance architectures (text-only natural-language policies vs. mechanical enforcement primitives run outside the model loop), combined with a causal ablation that removes each mechanical primitive in turn to test necessity; inference is based on between-condition differences in five rationale-level governance metrics and task accuracy. GeneralizabilitySynthetic banking domain may not capture full complexity of real regulated workflows or heterogeneous customer data, Results may depend on the specific LLM architecture/version, prompting, and training data used (not shown here), Governance metrics were designed by authors and may not map fully to legal/auditor standards in real regulation, Scalability, latency, integration costs and human oversight practices in production systems are not evaluated, Cross-jurisdictional regulatory variation and multilingual settings are not covered

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
Under text-only governance, 27% of deferrals carry no decision-relevant information. Governance And Regulation negative high fraction of deferrals that contain no decision-relevant information
27% of deferrals
0.48
Mechanical enforcement reduces the rate of deferrals that carry no decision-relevant information by 73%. Governance And Regulation positive high relative reduction in rate of non-informative deferrals
reduces this rate by 73%
0.48
Mechanical enforcement more than doubles deferral information content. Governance And Regulation positive high deferral information content (information-theoretic or content metric reported by authors)
more than doubles deferral information content
0.48
Mechanical enforcement raises task accuracy from MCC ~0.43 to 0.88. Decision Quality positive high task accuracy (Matthews correlation coefficient)
MCC ~0.43 to 0.88
0.48
The improvement from mechanical enforcement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance — the gain comes from removing clear-cut decisions from the model's control. Governance And Regulation mixed high CDL of LLM-generated rationales (comparative constraint-level metric) and locus of decisions (architectural control)
LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance
0.48
A causal ablation confirms that each of the four mechanical enforcement primitives is individually necessary. Governance And Regulation positive high impact of removing each mechanical primitive on governance/task metrics (necessity demonstrated by ablation)
0.48
There is a governance–task decoupling: under structural stress, text-only governance degrades on both governance and task dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. Governance And Regulation mixed high relative robustness of governance quality vs task performance under structural stress
0.48
Accuracy is not a sufficient proxy for governance in regulated AI systems. Governance And Regulation negative high sufficiency of task accuracy as a proxy for governance/auditability
0.48

Notes