The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Behavioural testing and red‑teaming cannot guarantee the absence of hidden objectives or long‑horizon agentic behaviour, creating an 'audit gap' between what regulators demand and what can be verified; regulators should limit reliance on behavioural evidence and require deeper mechanistic access such as probes, activation patching, and pre/post‑training comparisons.

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
Pratinav Seth, Vinay Kumar Sankarapu · May 14, 2026
arxiv commentary n/a evidence 7/10 relevance Source PDF
The paper argues behavioural assurance (tests and red-teaming) cannot reliably verify latent objectives or long-horizon agentic behaviours demanded by recent AI governance frameworks, formalizes this shortfall as the 'audit gap' and 'fragile assurance', analyzes a 21-instrument inventory showing incentives favor surface-level proxies, and recommends legally bounding behavioural evidence while extending mechanistic pre-deployment access (linear probes, activation patching, before/after-training comparisons).

This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

Summary

Main Finding

Behavioural assurance (behavioural evaluations, red-teaming, documentation) cannot epistemically verify the high-consequence absence claims that contemporary AI governance now often requires (absence of hidden objectives, resistance to loss-of-control precursors, bounded catastrophic capability). The paper formalizes this structural mismatch as the "audit gap" and the evidential weakness as "fragile assurance." Mechanistic interpretability is necessary to close that gap but is not yet mature enough on its own; the authors therefore propose a pragmatic, structured-access pivot to extend voluntary pre-deployment access with mechanistic evidence classes (e.g., linear probes, activation patching, before/after-training comparisons) and to legally limit the evidential weight of purely behavioural proof.

Key Points

  • Audit gap — defined formally: for an instrument i, Ai = access level implicitly assumed by its accepted evidence, Vi = access level achievable by independent verifiers; the audit gap is the interval [Vi, Ai]. In practice Vi < Ai for key governance instruments.
  • Fragile assurance — a safety claim is fragile if (i) it cannot be reproducibly checked by an independent party under comparable conditions, or (ii) the inferential gap between evidence and asserted claim is unsupported by the evidence’s structure. Fragile claims can be true but lack reproducible verification.
  • Representative anchors and inventory — the argument is anchored in seven governance instruments (EU AI Act Art. 55, California SB-53, Singapore AI Verify, South Korea AI Basic Act, India AI Governance Guidelines, Council of Europe AI Convention, OECD Recommendation) and extended to a 21-row inventory; none of the anchors uniformly align regulatorially-assumed access with verifier-achievable access.
  • Access taxonomy and coding — uses the behavioural → outside-the-box → grey-box → white-box → state-embedded access taxonomy. Many statutes presume grey-/white-box evidence for absence claims while independent verifiers routinely have behavioural or outside-the-box access only.
  • Behavioural evidence fit — behavioural assurance is well-matched to decomposable claims (bias, narrow capability bounds) but is epistemically insufficient for absence claims about latent properties (hidden objectives, deception, long-horizon agentic goals).
  • Incentive gradient — three uncoordinated pressures (speed/first-mover advantage, sovereignty/industrial policy, and pragmatic tractability of behavioural proxies) bias institutions and firms toward surface-level behavioural proxies, reinforcing fragile assurance and widening the audit gap.
  • Verification retreat — despite growth in external behavioural evaluation ecosystems (UK AISI, METR, Apollo, FAR.AI), mechanistic interpretability capacity at frontier labs and institutional emphasis on pre-deployment catastrophic-risk verification have diminished, making the audit gap structural rather than transient.
  • Agentic deployment exacerbates the problem — agentic systems introduce long-horizon compounding, tool-use, attribution complexity, and evaluation-awareness; empirical results (from other cited works) show internal states relevant to safety (e.g., intention/refusal cliffs, deception markers) may be invisible to output-only tests but detectable via probes and activation analysis — yet such internal access is not generally available to independent verifiers.
  • Practical proposal (technical pivot) — do not mandate immediate universal mechanistic access; instead (a) bound the weight of behavioural evidence in law/text for certain absence claims, and (b) extend voluntary pre-deployment structured-access arrangements to include mechanistic-evidence classes (linear probes, activation patching, before/after-training comparisons) implemented in enclave-style verification settings and incorporated into safety case / Claims-Arguments-Evidence (CAE) frameworks.

Data & Methods

  • Documentary analysis: coded seven anchor governance instruments (plus a 21-row extended inventory) against an access taxonomy from prior work [behavioural, outside-the-box, grey-box, white-box, state-embedded] to assess whether statutory/accepted evidence implicitly requires verifier access beyond what independent verifiers currently obtain.
  • Operationalized audit gap: for each instrument i, determined Ai (assumed evidence access implied by statute) and judged Vi (highest access level routinely available to independent verifiers under 2024–25 voluntary access ecosystems). The gap [Vi, Ai] was used to colour-code severity (green/amber/red).
  • Literature synthesis: drew on empirical and theoretical literature in mechanistic interpretability, external evaluation/structured-access programs, and agentic misalignment reports (citing work showing internal-state detection via linear probes and activation analyses, plus empirical reports of insider-threat behaviours in frontier models).
  • Conceptual arguments and case summaries: combined the matrix coding, incentive-gradient analysis, and qualitative institutional trend evidence (names/renamings of institutes, summit framing shifts, regulatory drafting changes like SB-1047→SB-53) to argue the audit gap is structural and growing for high-consequence absence claims.
  • Proposed pilot: an actionable mechanistic-evidence pilot at contract-level is specified (P1–P6 referenced in the paper), though full protocol details are in section 7 (not all reproduced in this excerpt). The proposal builds on structured-access/enclave architectures and CAE safety-case frameworks.

Limitations of methods noted by authors: - The claim applies principally to high-consequence absence claims (not to decomposable, moderate-risk properties where behavioural evaluation is often adequate). - Coding judgements (Vi assignments) are normative and sensitivity-tested; borderline cases are marked amber where structured-access protocols could close the gap in principle.

Implications for AI Economics

  • Market and investment incentives
    • Current incentives (speed, sovereignty, tractability) bias investment toward behavioural-eval tooling and faster deployment rather than deeper mechanistic verification. That can produce systematic underinvestment in technologies and teams needed to provide verifiable mechanistic evidence, potentially raising systemic-tail risk.
    • If regulators bound behavioural evidence (as recommended), firms will face higher compliance costs for high-consequence claims, shifting R&D budgets toward interpretability tooling and secure enclave infrastructures. This may advantage better-funded incumbents and raise barriers to entry.
  • Industrial policy and geopolitics
    • Sovereign competition (the incentive gradient) could drive regulatory arbitrage: jurisdictions emphasizing permissive evidentiary standards may attract faster deployments, while jurisdictions requiring mechanistic access (or valuing it in procurement) may shift the locus of frontier lab activity.
    • Coordinated international frameworks or harmonized procurement standards could reduce harmful regulatory race-to-the-bottom incentives, but coordination itself has political-economy costs.
  • Firm valuation, liability, and insurance
    • Fragile assurance increases ambiguity in firm risk profiles. Insurers and investors will need to price the risk that behavioural-based safety claims are fragile and may not forestall catastrophic externalities. This can increase insurance premia for agentic-system deployments or require new forms of liability-sharing and indemnification.
    • Firms that can credibly provide structured mechanistic evidence (via enclaves or trusted third parties) may command a premium in procurement and a valuation uplift due to lower tail-risk exposure.
  • Compliance costs and economic efficiency
    • Requiring mechanistic evidence or bounding behavioural evidence increases compliance costs (setup of secure enclaves, compute and personnel costs for probes/activation analyses, auditing services). In the short run, this may slow deployment and raise product prices; in the medium term, it could reallocate resources toward public-good interpretability research.
  • Market for verification services
    • If structured mechanistic access becomes the norm (even as voluntary or bounded by law), a specialized market for mechanistic audits and enclave-hosted verification will grow. This creates positive economic opportunities (audit firms, secure compute providers) but also concentrated power in verifiers and enclave operators.
  • Public goods and funding rationale
    • The authors’ diagnosis implies under-provision of mechanistic verification capability relative to societal need. This provides an economic argument for public funding or subsidies for interpretability research, shared verification infrastructure, and international standards, to correct market failure driven by the incentive gradient.
  • Policy levers economists should consider
    • Quantify the audit gap in cost/risk terms for assessments (e.g., expected-value models of tail risk under behavioural-only assurance).
    • Design subsidies or procurement preferences to internalize social benefits of deep verification (funding frontier interpretability research, tax credits for mechanistic audits).
    • Structure liability and insurance regimes to reflect evidential friction: require higher evidential standards for claims used in liability shields; allow safe-harbour reductions only with verifiable mechanistic evidence.
    • Encourage interoperable structured-access standards (enclave APIs, audit data formats) to reduce verification transaction costs and lower entry barriers for trusted auditors.
    • Track and model international regulatory heterogeneity to evaluate competitive effects and the likelihood of regulatory arbitrage.
  • Short-run vs long-run trade-offs
    • Short run: imposing mechanistic verification raises costs and delays, risks concentrating capacity among large firms, and could slow beneficial deployments.
    • Long run: better mechanistic verification could reduce systemic-tail risk, stabilize insurance markets, and create durable competitive advantages for firms that invest in verifiable safety, potentially improving welfare by reducing catastrophic-risk externalities.
  • Practical next steps for economists and policymakers
    • Incorporate the audit gap into risk assessments and cost–benefit analyses of AI regulation.
    • Model the economic effects of bounding behavioural evidence versus phased adoption of structured mechanistic audits (e.g., pilot programs, subsidies).
    • Design procurement and subsidy mechanisms to accelerate shared mechanistic-verification capacity without unduly privileging incumbents.

Summary recommendation for economists: treat the audit gap as an economic externality and market-failure problem. Economics analysis should quantify verification costs, the distributional effects across firm sizes and jurisdictions, and design policy tools (procurement, subsidies, liability rules, standards) to realign incentives toward economically efficient levels of mechanistic verification for high-consequence AI systems.

Assessment

Paper Typecommentary Evidence Strengthn/a — This is a normative/position paper that advances conceptual arguments and a qualitative inventory rather than presenting empirical causal evidence or statistical inference. Methods Rigormedium — The paper formalizes concepts (audit gap, fragile assurance), systematically analyzes a curated 21-instrument inventory, and links technical and policy literatures; however, it lacks empirical validation, formal proofs of feasibility, and quantitative evaluation of proposed measures. SampleQualitative review of AI governance frameworks enacted between 2019 and early 2026, plus a curated inventory of 21 assurance instruments and an assessment of prevailing assurance techniques (behavioral evaluations, red-teaming) and proposed mechanistic evidence classes (linear probes, activation patching, pre/post-training comparisons); no new experimental or econometric datasets are used. Themesgovernance adoption human_ai_collab innovation GeneralizabilityTime-bounded analysis (2019–early 2026) that may not capture later regulatory developments or technical advances, Conceptual and qualitative focus limits applicability to specific model architectures, proprietary deployment arrangements, or sectoral contexts, Proposed mechanistic access methods are presented at a high level and their operational, legal, and cost feasibility is not empirically demonstrated, Jurisdictional and institutional differences mean the identified incentive gradient may vary across countries and industry sectors, Does not provide direct measurement of economic outcomes (productivity, employment, wages), limiting transferability to economic impact assessments

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability. Governance And Regulation positive high governance_and_regulation
0.06
Current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify latent representations or long-horizon agentic behaviours. Ai Safety And Ethics negative high ai_safety_and_ethics
0.06
Behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. Ai Safety And Ethics negative high ai_safety_and_ethics
0.06
We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access). Governance And Regulation positive high governance_and_regulation
0.01
We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim. Ai Safety And Ethics positive high ai_safety_and_ethics
0.01
An analysis of a 21-instrument inventory identifies an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Governance And Regulation negative high governance_and_regulation
n=21
0.06
The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons). Governance And Regulation positive high governance_and_regulation
0.01

Notes