Blaming model internals for unsafe AI deployments misplaces the gatekeeper: regulators should require domain-scoped, independently verifiable 'verification coverage'—including monitoring, accountability and revocation—because mechanistic understanding often fails to translate into safer outputs and post-market oversight is rare.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

Phongsakon Mark Konrad, Tim Lukas Adam, Ane Cathrine Holst Merrild, Riccardo Terrenzi, Rebecca De Rosa, Toygar Tanyel, Serkan Ayvaz · May 11, 2026

arxiv commentary low evidence 7/10 relevance Source PDF

The paper argues that authorization for AI in sensitive domains should be based on domain-scoped, independently verifiable, monitorable, and revocable 'verification coverage' rather than demanding mechanistic interpretability of model internals.

AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

Summary

Main Finding

The paper argues that mechanistic interpretability ("open-box") is useful but insufficient as the sole gate for high-stakes AI deployment. Instead of demanding model-level explanation as the decisive authorization condition (the "open-box fallacy"), regulators and institutions should adopt a calibrated verification regime that authorizes specific deployments (not models) based on a plural set of verifiable properties. The authors operationalize this with "Verification Coverage": a six-component, reportable profile (with a minimum-composition rule) that must be present for a deployment to be considered for authorization.

Key Points

Open-box principle vs open-box fallacy
- Principle: mechanistic evidence is deployment-relevant.
- Fallacy: treating mechanistic evidence as the decisive, standalone deployment gate is misguided.
Two central reasons to reject the fallacy
Jagged capability: model performance is uneven across closely related tasks, so authorization must be scoped to specific uses rather than to a model wholesale.
Institutional precedent: societies routinely authorize opaque expertise (e.g., professional practice) via credentials, monitoring, liability, appeal, and revocation — not full mechanistic access.
Four evidence streams (inputs to verification)
- Mechanistic evidence (internal inspection)
- Behavioural evaluation (benchmarks, trials, red teams)
- Independent review (audits, regulators, reproducibility checks)
- Domain-expert and stakeholder input (contestability, context-specific harms)
Three verifier classes (modes of checking)
- Formal verifiers (proofs, type checks, cryptographic checks)
- Empirical verifiers (clinical trials, surveillance, test suites)
- Social-normative verifiers (procedures for rights-, liberty-, or opportunity-affecting decisions)
Six required regime properties → Verification Coverage components
Domain Coverage (scope of authorized use; proxy: log-share inside scope)
Verifier Strength (presence and quality of checks; proxy: per-output check & evaluator-accuracy curves)
Monitoring Maturity (post-deployment surveillance; proxy: statistically valid surveillance plan)
Accountability Clarity (named accountable actor; proxy: identifiable party)
Contestability (appeal/reason-giving process; proxy: documented path, response metrics)
Revocation Readiness (predeclared triggers/procedures to withdraw or restrict; proxy: thresholds, time-to-action)
Reporting rule
- Verification Coverage is first reported as the 6-component binary/profile vector v(d).
- Minimum-composition rule: VCmin(d) = min_i v_i(d). Any required rail missing (0) withholds authorization.
- Optional aggregate VCw,D can be reported with institution-specific weights but must not replace the profile.
Empirical/illustrative evidence cited
- Basu et al.: a 53-percentage-point gap between internal representation detection and corrective output behaviour in a clinical setting (illustrates internal understanding need not translate to safe behaviour).
- Scoping review: only 9.0% of FDA-approved AI/ML devices had a prospective post-market surveillance study (illustrates weak monitoring rail in medicine).
- Dell’Acqua et al.: AI improved many consulting tasks but reduced correctness by 19 percentage points on a complex managerial task (illustrates jagged capability).
- Additional citations on evaluation brittleness and gaming: evaluation-detection ability, sandbagging, sycophancy, unfaithful chain-of-thought rationales, and a real-world rollback (GPT-4o April 2025) as examples of evaluation limits and deployment surprises.

Data & Methods

Paper type: position / conceptual synthesis, not a primary empirical study.
Methods:
- Literature synthesis across mechanistic interpretability, empirical evaluations, regulatory practice, and domain case studies (medicine, credit, employment, criminal justice, autonomous systems).
- Conceptual argument mapping known failure modes to institutional properties needed for safe deployment.
- Formalization of a reporting metric: a 6-dimensional binary/profile v(d) and aggregation rules (VCmin and weighted VCw,D), with concrete proxies for each component to guide reporting and enforcement.
Empirical inputs are drawn from prior studies and scoping reviews cited in the paper (examples listed above); no new dataset or randomized experiment is presented.

Implications for AI Economics

Market structure and firm incentives
- Authorization-by-deployment (not by-model) raises transaction and compliance costs for integrators, increasing demand for third-party verifiers, auditors, and certification services — a new market for verification infrastructure.
- Firms may compete along two dimensions: raw capability (benchmarks) and Verification Coverage (readiness to deploy in regulated settings). VC may become a competitive differentiator and affect pricing, contracting, and procurement.
- The minimum-composition rule implies non-compensable requirements: strong capability cannot substitute for missing contestability, monitoring, or accountability. This changes cost–benefit calculations of deploying models in high-stakes domains.
Adoption, diffusion, and productivity
- Jagged capability implies heterogenous substitution effects: some tasks will see rapid AI adoption with productivity gains, others will lag due to verification gaps. Aggregate productivity estimates should account for verification costs and domain-specific authorization lags.
- Authorization frictions (assessments, monitoring obligations, liability exposure) can slow diffusion of powerful general models into regulated domains, preserving incumbent roles or creating demand for specialized domain-specific models and services.
Regulation, liability, and insurance
- Clear, reportable Verification Coverage metrics enable insurers and purchasers to price deployment risk more granularly and could support liability regimes tied to VC outcomes (e.g., higher premiums for low monitoring maturity).
- Regulators can operationalize VC components in procurement rules, conditional approvals, or post-market surveillance mandates; that creates demand for standard-setting and accredited verification bodies.
Measurement and empirical research agenda
- Economists should incorporate verification costs, verifier supply and credibility, and VC-compliance probabilities into models of firm adoption, investment in AI capabilities, and welfare analyses.
- Empirical work can measure how VC presence affects outcomes: adoption rates, error incidence, consumer trust, litigation rates, and market concentration.
- Laboratory and field experiments could quantify the marginal value of each VC component (e.g., how much contestability reduces consumer harms or how monitoring maturity affects drift-related failures).
Policy design and public goods
- Public policy levers: require VC reporting in model cards and procurement; subsidize independent auditors; mandate minimum VCmin thresholds for certain rights-affecting domains; fund public-interest monitoring to lower verification asymmetries.
- Standardization of VC proxies (the paper gives concrete proxies) would reduce information asymmetries between deployers, regulators, and affected parties.
Long-run R&D incentives
- Shifts emphasis from purely mechanistic interpretability research to verification engineering: surveillance design, formal/emprical verifier development, contestability mechanisms, and revocation protocols.
- Open-source vs closed-source trade-offs change: open distribution increases number of deployment contexts (hence verification duties travel with each integrator), which may reduce welfare from unfettered openness in certain regulated domains unless paired with accessible verification services.

Suggested next steps for AI economists and policymakers - Empirically measure prevalence of the six VC components across deployments and domains. - Model adoption decisions that include verification cost schedules and liability exposure. - Design pilot programs where VC reporting is tied to procurement or insurance discounts to estimate behavioral responses and social benefits. - Study markets for independent verification, accreditation, and contest resolution to understand supply, pricing, and potential for market failure.

Summary takeaway: The paper reframes the deployment question away from a single demand for mechanistic explanation toward a plural, deployment-scoped verification regime. For AI economics, that implies new markets, altered adoption dynamics, measurable regulatory instruments (Verification Coverage), and a research agenda integrating verification costs and institutional design into models of AI diffusion and welfare.

Assessment

Paper Typecommentary Evidence Strengthlow — Primarily a normative and conceptual argument relying on selective prior findings (two empirical citations are used illustratively) rather than new or systematic empirical analysis demonstrating the proposed approach's effectiveness or causal impact. Methods Rigorn/a — This is a policy/argumentative paper without original empirical methods, experimental design, or formal identification strategy to evaluate causal claims. SampleNo original dataset; the paper cites prior empirical work (e.g., a study reporting a 53 percentage-point gap between internal representations and output correction, and a scoping review of FDA-approved AI/ML device documents reporting 9.0% with prospective post-market surveillance) and otherwise builds a conceptual proposal. Themesgovernance adoption human_ai_collab GeneralizabilityArgument is normative and regulatory — applicability depends on institutional and legal contexts (e.g., FDA vs other regulators)., Cited empirical evidence is limited and domain-specific, so illustrative findings may not generalize across sectors or model architectures., Feasibility of independent verification, monitoring, and revocation varies by firm resources, jurisdiction, and technical access to models and logs., Does not provide empirical tests of the proposed Verification Coverage standard across industries, so operational outcomes are uncertain.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
A 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action. Ai Safety And Ethics	negative	high	gap between internal model representations and ability to correct outputs	53-percentage-point gap 0.06
A scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. Regulatory Compliance	negative	high	presence of prospective post-market surveillance study in FDA AI/ML device documents	9.0% 0.06
AI deployment in sensitive domains (health care, credit, employment, criminal justice) is often treated as unsafe to authorize until model internals can be explained. Governance And Regulation	negative	high	authorization policy stance toward AI in sensitive domains (requirement for internal explanation before deployment)	0.03
This reliance frequently leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. Ai Safety And Ethics	negative	high	appropriateness of mechanistic interpretability as a gate for deployment	0.06
Model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Governance And Regulation	positive	high	appropriateness of use-scoped authorization vs model-wide authorization	0.01
Societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Governance And Regulation	positive	medium	prevalent governance mechanisms for opaque expertise	0.04
The gate to deploy should be 'calibrated verification': authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. Governance And Regulation	positive	high	recommended features of deployment authorization regime	0.01
Verification Coverage, a six-component reportable standard with a minimum-composition rule, should sit beside capability scores in model cards, leaderboards, and regulatory disclosures. Adoption Rate	positive	high	inclusion of 'Verification Coverage' standard alongside capability scores in reporting artifacts	0.01