Behavioural testing and red‑teaming cannot guarantee the absence of hidden objectives or long‑horizon agentic behaviour, creating an 'audit gap' between what regulators demand and what can be verified; regulators should limit reliance on behavioural evidence and require deeper mechanistic access such as probes, activation patching, and pre/post‑training comparisons.
This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.
Summary
Main Finding
Behavioural assurance (behavioural evaluations, red-teaming, documentation) cannot epistemically verify the high-consequence absence claims that contemporary AI governance now often requires (absence of hidden objectives, resistance to loss-of-control precursors, bounded catastrophic capability). The paper formalizes this structural mismatch as the "audit gap" and the evidential weakness as "fragile assurance." Mechanistic interpretability is necessary to close that gap but is not yet mature enough on its own; the authors therefore propose a pragmatic, structured-access pivot to extend voluntary pre-deployment access with mechanistic evidence classes (e.g., linear probes, activation patching, before/after-training comparisons) and to legally limit the evidential weight of purely behavioural proof.
Key Points
- Audit gap — defined formally: for an instrument i, Ai = access level implicitly assumed by its accepted evidence, Vi = access level achievable by independent verifiers; the audit gap is the interval [Vi, Ai]. In practice Vi < Ai for key governance instruments.
- Fragile assurance — a safety claim is fragile if (i) it cannot be reproducibly checked by an independent party under comparable conditions, or (ii) the inferential gap between evidence and asserted claim is unsupported by the evidence’s structure. Fragile claims can be true but lack reproducible verification.
- Representative anchors and inventory — the argument is anchored in seven governance instruments (EU AI Act Art. 55, California SB-53, Singapore AI Verify, South Korea AI Basic Act, India AI Governance Guidelines, Council of Europe AI Convention, OECD Recommendation) and extended to a 21-row inventory; none of the anchors uniformly align regulatorially-assumed access with verifier-achievable access.
- Access taxonomy and coding — uses the behavioural → outside-the-box → grey-box → white-box → state-embedded access taxonomy. Many statutes presume grey-/white-box evidence for absence claims while independent verifiers routinely have behavioural or outside-the-box access only.
- Behavioural evidence fit — behavioural assurance is well-matched to decomposable claims (bias, narrow capability bounds) but is epistemically insufficient for absence claims about latent properties (hidden objectives, deception, long-horizon agentic goals).
- Incentive gradient — three uncoordinated pressures (speed/first-mover advantage, sovereignty/industrial policy, and pragmatic tractability of behavioural proxies) bias institutions and firms toward surface-level behavioural proxies, reinforcing fragile assurance and widening the audit gap.
- Verification retreat — despite growth in external behavioural evaluation ecosystems (UK AISI, METR, Apollo, FAR.AI), mechanistic interpretability capacity at frontier labs and institutional emphasis on pre-deployment catastrophic-risk verification have diminished, making the audit gap structural rather than transient.
- Agentic deployment exacerbates the problem — agentic systems introduce long-horizon compounding, tool-use, attribution complexity, and evaluation-awareness; empirical results (from other cited works) show internal states relevant to safety (e.g., intention/refusal cliffs, deception markers) may be invisible to output-only tests but detectable via probes and activation analysis — yet such internal access is not generally available to independent verifiers.
- Practical proposal (technical pivot) — do not mandate immediate universal mechanistic access; instead (a) bound the weight of behavioural evidence in law/text for certain absence claims, and (b) extend voluntary pre-deployment structured-access arrangements to include mechanistic-evidence classes (linear probes, activation patching, before/after-training comparisons) implemented in enclave-style verification settings and incorporated into safety case / Claims-Arguments-Evidence (CAE) frameworks.
Data & Methods
- Documentary analysis: coded seven anchor governance instruments (plus a 21-row extended inventory) against an access taxonomy from prior work [behavioural, outside-the-box, grey-box, white-box, state-embedded] to assess whether statutory/accepted evidence implicitly requires verifier access beyond what independent verifiers currently obtain.
- Operationalized audit gap: for each instrument i, determined Ai (assumed evidence access implied by statute) and judged Vi (highest access level routinely available to independent verifiers under 2024–25 voluntary access ecosystems). The gap [Vi, Ai] was used to colour-code severity (green/amber/red).
- Literature synthesis: drew on empirical and theoretical literature in mechanistic interpretability, external evaluation/structured-access programs, and agentic misalignment reports (citing work showing internal-state detection via linear probes and activation analyses, plus empirical reports of insider-threat behaviours in frontier models).
- Conceptual arguments and case summaries: combined the matrix coding, incentive-gradient analysis, and qualitative institutional trend evidence (names/renamings of institutes, summit framing shifts, regulatory drafting changes like SB-1047→SB-53) to argue the audit gap is structural and growing for high-consequence absence claims.
- Proposed pilot: an actionable mechanistic-evidence pilot at contract-level is specified (P1–P6 referenced in the paper), though full protocol details are in section 7 (not all reproduced in this excerpt). The proposal builds on structured-access/enclave architectures and CAE safety-case frameworks.
Limitations of methods noted by authors: - The claim applies principally to high-consequence absence claims (not to decomposable, moderate-risk properties where behavioural evaluation is often adequate). - Coding judgements (Vi assignments) are normative and sensitivity-tested; borderline cases are marked amber where structured-access protocols could close the gap in principle.
Implications for AI Economics
- Market and investment incentives
- Current incentives (speed, sovereignty, tractability) bias investment toward behavioural-eval tooling and faster deployment rather than deeper mechanistic verification. That can produce systematic underinvestment in technologies and teams needed to provide verifiable mechanistic evidence, potentially raising systemic-tail risk.
- If regulators bound behavioural evidence (as recommended), firms will face higher compliance costs for high-consequence claims, shifting R&D budgets toward interpretability tooling and secure enclave infrastructures. This may advantage better-funded incumbents and raise barriers to entry.
- Industrial policy and geopolitics
- Sovereign competition (the incentive gradient) could drive regulatory arbitrage: jurisdictions emphasizing permissive evidentiary standards may attract faster deployments, while jurisdictions requiring mechanistic access (or valuing it in procurement) may shift the locus of frontier lab activity.
- Coordinated international frameworks or harmonized procurement standards could reduce harmful regulatory race-to-the-bottom incentives, but coordination itself has political-economy costs.
- Firm valuation, liability, and insurance
- Fragile assurance increases ambiguity in firm risk profiles. Insurers and investors will need to price the risk that behavioural-based safety claims are fragile and may not forestall catastrophic externalities. This can increase insurance premia for agentic-system deployments or require new forms of liability-sharing and indemnification.
- Firms that can credibly provide structured mechanistic evidence (via enclaves or trusted third parties) may command a premium in procurement and a valuation uplift due to lower tail-risk exposure.
- Compliance costs and economic efficiency
- Requiring mechanistic evidence or bounding behavioural evidence increases compliance costs (setup of secure enclaves, compute and personnel costs for probes/activation analyses, auditing services). In the short run, this may slow deployment and raise product prices; in the medium term, it could reallocate resources toward public-good interpretability research.
- Market for verification services
- If structured mechanistic access becomes the norm (even as voluntary or bounded by law), a specialized market for mechanistic audits and enclave-hosted verification will grow. This creates positive economic opportunities (audit firms, secure compute providers) but also concentrated power in verifiers and enclave operators.
- Public goods and funding rationale
- The authors’ diagnosis implies under-provision of mechanistic verification capability relative to societal need. This provides an economic argument for public funding or subsidies for interpretability research, shared verification infrastructure, and international standards, to correct market failure driven by the incentive gradient.
- Policy levers economists should consider
- Quantify the audit gap in cost/risk terms for assessments (e.g., expected-value models of tail risk under behavioural-only assurance).
- Design subsidies or procurement preferences to internalize social benefits of deep verification (funding frontier interpretability research, tax credits for mechanistic audits).
- Structure liability and insurance regimes to reflect evidential friction: require higher evidential standards for claims used in liability shields; allow safe-harbour reductions only with verifiable mechanistic evidence.
- Encourage interoperable structured-access standards (enclave APIs, audit data formats) to reduce verification transaction costs and lower entry barriers for trusted auditors.
- Track and model international regulatory heterogeneity to evaluate competitive effects and the likelihood of regulatory arbitrage.
- Short-run vs long-run trade-offs
- Short run: imposing mechanistic verification raises costs and delays, risks concentrating capacity among large firms, and could slow beneficial deployments.
- Long run: better mechanistic verification could reduce systemic-tail risk, stabilize insurance markets, and create durable competitive advantages for firms that invest in verifiable safety, potentially improving welfare by reducing catastrophic-risk externalities.
- Practical next steps for economists and policymakers
- Incorporate the audit gap into risk assessments and cost–benefit analyses of AI regulation.
- Model the economic effects of bounding behavioural evidence versus phased adoption of structured mechanistic audits (e.g., pilot programs, subsidies).
- Design procurement and subsidy mechanisms to accelerate shared mechanistic-verification capacity without unduly privileging incumbents.
Summary recommendation for economists: treat the audit gap as an economic externality and market-failure problem. Economics analysis should quantify verification costs, the distributional effects across firm sizes and jurisdictions, and design policy tools (procurement, subsidies, liability rules, standards) to realign incentives toward economically efficient levels of mechanistic verification for high-consequence AI systems.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability. Governance And Regulation | positive | high | governance_and_regulation |
0.06
|
| Current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify latent representations or long-horizon agentic behaviours. Ai Safety And Ethics | negative | high | ai_safety_and_ethics |
0.06
|
| Behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. Ai Safety And Ethics | negative | high | ai_safety_and_ethics |
0.06
|
| We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access). Governance And Regulation | positive | high | governance_and_regulation |
0.01
|
| We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim. Ai Safety And Ethics | positive | high | ai_safety_and_ethics |
0.01
|
| An analysis of a 21-instrument inventory identifies an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Governance And Regulation | negative | high | governance_and_regulation |
n=21
0.06
|
| The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons). Governance And Regulation | positive | high | governance_and_regulation |
0.01
|