Regulatory audit metrics can be gamed by routing content through equivalent variants, undermining safety claims; a provably minimal 'semantic-envelope' repair restores certificate guarantees in the authors' formal model and synthetic tests, though real-world validation remains needed.

Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

Florian A. D. Burnat, Brittany I. Davidson · May 07, 2026

arxiv theoretical medium evidence 7/10 relevance Source PDF

Scalar audit metrics that score content variants directly are manipulable by platforms; a classwise 'semantic-envelope' repair is the minimally conservative fix that yields a provable class-stratified certificate against gaming in the formal model and synthetic checks.

Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.

Summary

Main Finding

Auditors who publish scalar safety metrics for recommender systems risk Goodhart-style gaming: platforms can rearrange which semantically equivalent content variant is served to reduce the measured score without reducing actual harm. The paper shows that (1) any metric that scores variants directly is manipulable whenever within-class variant scores differ; (2) a conservative repair—the semantic-envelope, which assigns each variant the maximum score observed in its semantic class—is the unique pointwise-minimal classwise-constant fix that restores invariance to within-class manipulation; and (3) under this envelope one can derive a class-stratified, strategy-agnostic certificate bounding true harmful exposure up to published coverage and annotation error terms. Synthetic and formal-method stress tests show the fragile (direct-scoring) metric routinely violates useful certificates, while the semantic-envelope does not in tested instances.

Key Points

Policy context: Ofcom (UK) and EU/DSA guidance increasingly treat scalar exposure metrics as evidence of compliance for recommender systems; this converts metrics into optimization targets for platforms.
Threat model (narrow, explicit): platform-side within-class representation choice only — the auditor publishes a transformation graph (candidate variant pairs, attribute checklist, validation scores, threshold) that induces semantic classes; platform can choose distribution over variants but cannot change the published protocol.
Definitions:
- Manipulation invariance: metric depends only on semantic-class mass, not on which variant within a class is served.
- Certification: a (γ, β)-certificate guarantees H(x) ≤ γ M(x) + β for every admissible platform strategy x.
- Class-coverage certificate: auditor publishes per-class lower bounds (P_c) and a strictness ε; useful when τ/ε + β < 1 (τ = audit budget).
Fragility result: direct per-variant scoring (the “fragile metric”) fails manipulation invariance whenever two variants in the same class have different scores — i.e., trivial to game.
Semantic-envelope repair: Env(m)(v) = max_{u in cl(v)} m(u). Properties:
- Restores manipulation invariance (metric depends only on class mass).
- Is the unique pointwise-minimal conservative classwise-constant repair (it raises scores only as much as necessary to guarantee classwise constancy and never above the observed class max).
Certification theorem: given harm-pure classes and per-class coverage α̂ (min over harmful classes of the maximal observed score in each class), Env yields a certificate H*(x) ≤ (1/α̂) MEnv(m)(x) + η̄, where η̄ absorbs annotation and protocol disagreement. The paper extends this to imperfect (non-pure) classes via a disagreement mass Δ(x) that is published/upper-bounded.
Protocol sensitivity: tightening validation threshold or pruning edges refines the partition and weakly decreases envelope scores pointwise; auditors should publish sensitivity analyses (e.g., max path length, weakest-edge confidence) so transitive merging costs are visible.
Experiments & verification: conducted synthetic stress tests and cross-checked using exhaustive enumeration on finite-state mixed-strategy grids, SMT encodings in Z3 and cvc5, and bounded single-player MDPs in PRISM-games. Compared semantic-envelope to fragile (direct) scoring and a class-mean repair.
- Empirical outcome: fragile metric repeatedly violated envelope-style certificates with large gaming gaps under fixed audit budgets; class-mean gave looser certificates; semantic-envelope showed no violations in tested instances.

Data & Methods

Modeling:
- Finite variant set V; published transformation family T0; attribute-preservation predicate A(v,u); human validation scores s(v,u); acceptance threshold ρ; admissible edge set E_ρ; semantic classes C = connected components of closure(E_ρ).
- Platform strategy x ∈ Δ(V) (probability distribution over variants), utility u(v) arbitrary; auditor reports metric M_m(x) = Σ_v x_v m(v) and enforces budget M_m(x) ≤ τ.
- Harm labels: ideal-case per-class harm h(c) (harm-pure classes) and more realistic latent per-variant harm h(v); disagreement mass Δ(x) = Σ_v x_v |h(v) − ĥ(cl(v))| captures labeling/protocol error.
Theoretical results:
- Propositions and theorems proving manipulability of direct scoring, optimality/minimality of the semantic-envelope among conservative classwise-constant repairs, derivation of class-stratified certification inequalities, and monotonicity under protocol refinement.
- Extensions that quantify slack from imperfect annotation and transitive-closure risks (publishable η̄ and Δ(x) terms).
Empirical / formal verification:
- Reproducible synthetic catalogs with harmful classes containing variants with different detector scores.
- Solved adversarial platform best-responses (LP/MDP) under audit budget constraints to produce worst-case violations.
- Cross-checked encodings: exhaustive enumeration on finite grids, SMT in Z3 and cvc5, bounded-MDP via PRISM-games.

Implications for AI Economics

Measurement-as-instrument creates incentives: treating a scalar metric as compliance evidence makes it an instrument that firms will optimize against, not just a passive report. This alters platform incentives and creates a regulatory-design externality that must be internalized by auditors and regulators.
Audits are strategic games; measurement robustness matters economically:
- Direct-scoring metrics lower audit enforcement effectiveness and can render regulatory thresholds vacuous, allowing platforms to maintain utility while masking harm via representation choices.
- The semantic-envelope is economically attractive because it prevents simple low-cost gaming (variant switching) and yields stronger pre-declared certificates that regulators can credibly rely on.
Trade-offs and costs:
- Conservatism vs. looseness: envelope increases measured scores (worst-case per class), which tightens enforcement but may require lower audit budgets or higher reported exposure. That can impose greater compliance costs on platforms (or require more stringent thresholds to attain the same allowed exposures).
- Audit design costs: building and validating transformation families, running human validation studies, and publishing sensitivity statistics (path lengths, weakest-edge confidences) are non-trivial. Regulators must budget for these institutional costs.
- Enforcement/operational cost on platforms: platforms may invest in representation engineering to circumvent audits where possible (if protocols are weak), or conversely invest in reducing truly harmful exposure if audits are robust.
Market structure and competition:
- Platforms with superior control over representation (better paraphrasing, thumbnail selection, localized renderings) can game fragile metrics more easily, potentially distorting competition if regulators do not adopt robust repairs.
- Robust auditing raises barriers to low-cost gaming and may favor platforms that actually reduce harm, shifting market incentives toward safer designs.
Regulatory capture and protocol bargaining risk:
- The model presumes a published protocol; in practice, platforms may lobby to shape T0, A, s, ρ. The paper explicitly excludes protocol renegotiation from its core model; economically, this is a key vulnerability—protocol design itself is a bargaining object with potential for capture.
Policy recommendations with economic interpretation:
- Treat audit metrics as security objects and publish the full protocol (transformations, validation scores, threshold) so that the audit surface is explicit and not manipulable by secrecy.
- Use semantic-envelope (max-over-class) repairs for classwise invariance; publish per-class coverage α̂ and an upper bound on disagreement mass η̄ so certificates are meaningful and auditable.
- Require auditors to publish sensitivity analyses (vary ρ, report weakest edge/confidence and max path length) so the marginal cost of transitive merging and annotation noise is transparent—this reduces hidden uncertainty that platforms could exploit.
Open economic research directions:
- Modeling multi-platform competition where platforms can strategically relocate harmful content across platforms or catalogs.
- Endogenizing protocol design: studying regulator–platform bargaining over T0, A, s, ρ and implications for welfare, enforcement efficacy, and potential capture.
- Quantifying costs of audit construction and human validation, and optimizing audit budgets τ versus statistical power/robustness.
- Dynamic arms races: platforms may invest in automated paraphrase generation to produce low-scoring admissible variants; dynamic models of investment and enforcement would illuminate long-run equilibria.
Practical takeaway for regulators and economists: scalar safety metrics must be hardened against obvious optimization channels; conservative, publishable repairs like the semantic-envelope provide a principled, minimally invasive way to restore manipulation invariance and produce credible predeclared certificates, but they change compliance incentives and impose measurable costs that regulators and policymakers should budget for and analyze.

Assessment

Paper Typetheoretical Evidence Strengthmedium — The paper presents formal theoretical results (provable manipulability and a unique conservative repair) and cross-validated simulation checks (exhaustive finite-state enumeration, SMT encodings in Z3/cvc5, and PRISM-games MDP traces). However, it lacks real-world field or observational data tying the metric behavior to deployed platform actions and real harms, so empirical external validity is limited. Methods Rigorhigh — The authors supply formal proofs, characterize uniqueness properties, and validate claims using three independent verification modalities (complete enumeration on finite grids, SMT solver cross-validation, and bounded MDP model checking), which together provide strong internal rigor and reproducibility for the formal model and synthetic experiments. SampleSynthetic instances and models: finite-state grid of mixed strategies and random content-catalogs, SMT encodings replayed in Z3 and cvc5, and a bounded single-player MDP encoded in PRISM-games; no observational field data or live platform logs were used. Themesgovernance adoption GeneralizabilityResults proven for the formal model and tested on synthetic/small-state instances; behavior on real-world, large-scale recommendation systems is untested., Assumes a published transformation graph and known semantic classes—real semantic equivalence may be noisy or unknown., Annotation and protocol errors are absorbed into a bounded term (bar_eta) but real-world annotation noise distributions may violate assumptions., Platform strategic behavior in practice may include richer dynamics (multi-agent competition, temporal adaptation, or unmodeled incentives) absent from the bounded MDP and finite-grid checks.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Governance And Regulation	null_result	high	regulatory_compliance	0.06
Once announced, such a metric becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. Governance And Regulation	negative	high	regulatory_compliance	0.12
Any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Governance And Regulation	negative	high	regulatory_compliance	0.2
The semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Governance And Regulation	positive	high	regulatory_compliance	0.2
A class-stratified certificate H*(x) ≤ (1/\hatα) M_{Env(m)}(x) + \barη holds for every platform strategy, with \barη absorbing annotation and protocol error. Governance And Regulation	positive	high	regulatory_compliance	H*(x) ≤ (1/\hatα) M_{Env(m)}(x) + \barη 0.2
The paper checks the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies; an SMT encoding in Z3 cross-replayed in cvc5; and a bounded single-player MDP encoded in PRISM-games. Research Productivity	positive	high	research_productivity	0.12
The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. Governance And Regulation	negative	medium	regulatory_compliance	large mean gaming gap (not numerically specified in abstract) 0.07
The semantic-envelope metric exhibits no such violation in the tested instances. Governance And Regulation	positive	high	regulatory_compliance	0.12