The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

New York’s mandated bias audits for hiring algorithms are weakened by major demographic data gaps—missingness in audit reports ranges from under 3% to over 50%—undermining the validity of reported fairness metrics. The authors argue many audits risk being symbolic and recommend using audit outputs for red‑teaming, improved data quality, and stronger oversight to make audits meaningful.

Towards Using Ai Bias Audits As Inputs For Red Teaming And Performance
Obi Ogbanufe · Fetched May 31, 2026 · Journal of the Association for Information Systems
openalex descriptive low evidence 7/10 relevance Source PDF
Analysis of New York City LL144 bias audits reveals wide variation in demographic data missingness (under 3% to over 50%), which undermines calculated fairness metrics and suggests many audits may amount to symbolic compliance rather than effective accountability.

AI-enabled hiring systems are widely adopted, yet their fairness remains uncertain. New York City’s Local Law 144 mandates annual bias audits to increase transparency. However, the effectiveness of these audits remains unclear. An analysis of LL144 audit reports reveals demographic missingness, from under 3% to over 50%, which reduces the applicant pool used for fairness calculation and undermines the metrics. Using institutional theory, we argue that such limitations reflect symbolic compliance, while stewardship theory highlights the potential for deeper accountability. We propose leveraging audit outputs as red-teaming inputs to stress-test fairness robustness and strengthen AI governance through improved data quality and oversight.

Summary

Main Finding

Public bias-audit reports mandated by New York City’s Local Law 144 frequently omit large shares of applicant demographic data (from under 3% up to >50%). This demographic missingness can materially undermine fairness metrics reported in audits. The paper shows that treating these audit outputs as inputs for red-teaming (adversarial sensitivity testing) reveals fragility in compliance conclusions and can be used to move organizations from symbolic compliance toward stewardship-oriented, substantive AI governance.

Key Points

  • Empirical observation: Demographic missingness rates (DMR = unknown / total assessed) vary widely across LL144 audit reports (examples: Burlington ~0.5–3.9%; several ADP audits 27–59%; RippleMatch 2024 race DMR = 54.6%).
  • Consequence: LL144 calculations exclude applicants with unknown demographics, so high missingness reduces effective sample sizes and can mask disparate impact.
  • Sensitivity example: RippleMatch 2024 — reported Asian impact ratio 0.854 (passes 4/5ths rule). If 20,000 of the unknown applicants were Asian with a lower scoring rate, the recalculated ratio falls to 0.787 (fails). Thus audit conclusions can flip under plausible reassignments of missing data.
  • Theoretical framing:
    • Institutional theory: Audits can become symbolic artifacts (decoupling) used to claim compliance without addressing underlying data or governance weaknesses.
    • Stewardship theory: Organizations motivated as stewards will use audits proactively (e.g., red-teaming) to find and mitigate hidden harms.
  • Proposal: Use audit outputs (especially measures of missingness) as adversarial inputs for red-teaming / stress testing to (a) surface fragile fairness claims, (b) trigger data-quality and governance improvements, and (c) improve stakeholder trust.
  • Propositions for future testing:
  • Red-teaming audit artifacts improves robustness of fairness outcomes.
  • Red-teaming reduces demographic missingness by motivating better data collection/reporting.
  • Red-teaming–driven improvements enhance governance and applicant trust.

Data & Methods

  • Data source: Public bias-audit reports filed under NYC Local Law 144.
  • Initial corpus: 30 independent bias-audits covering hiring-related AEDTs (examples: ADP, SmartAssistant, RippleMatch, Paradox, PLUM, HireVue, Burlington).
  • Inclusion: Removed 17 reports due to use of synthetic data or missing applicant-pool reporting → final sample of 16 audits analyzed.
  • Key metric: Demographic missingness rate (DMR) computed separately for sex and race (DMR = Unknown / Total Assessed).
  • Descriptive analysis: Tabulated total assessed, unknown sex, unknown race, and computed DMRs for each report (range: ~0.5% to >59%).
  • Sensitivity analysis / red-team experiment: Reassign plausible shares of the unknown-demographic pool to specific groups and recompute impact ratios (e.g., RippleMatch 2024 scenario where adding 20,000 Asians with a lower scoring rate changes impact ratio from pass to fail under the 4/5ths rule).
  • Interpretive lens: Institutional and stewardship theories used to infer organizational incentives behind audit quality and to motivate using audits for adversarial governance testing.

Implications for AI Economics

  • Compliance vs substantive governance costs:
    • Audits are a regulatory cost; if produced symbolically (high missingness, synthetic datasets), they produce weak social value. Regulators and firms face a trade-off between the marginal cost of higher-quality audits/red-teaming and the social benefits of more reliable fairness assessments.
    • Investing in red-teaming and improved demographic collection raises short-run costs (data-collection, privacy management, independent testing) but reduces longer-run legal, reputational, and labor-market frictions.
  • Incentives and market signaling:
    • Audit reports are a credibility signal in the market for hiring tools. If audits are routinely fragile, vendors can obtain market advantage via superficially compliant reports (adverse selection). Firms that adopt stewardship (transparent missingness, sensitivity testing) may earn a trust premium, differentiating on governance quality.
    • Vendors might prefer reporting modalities (e.g., synthetic datasets, ambiguous reporting) that minimize visible risk—regulatory design should counteract this by standardizing required disclosures (including DMR) and sensitivity testing.
  • Data quality externalities and privacy trade-offs:
    • Improving demographic completeness typically requires collecting more applicant data, which imposes privacy and compliance costs (consent management, storage, legal risk). There is an economic trade-off between the value of more precise fairness measures and the costs/risks of collecting sensitive attributes.
  • Labour-market allocation and productivity:
    • Biased or poorly-assessed hiring systems can misallocate labor (e.g., rejecting qualified candidates from certain groups), reducing aggregate productivity and increasing social welfare losses. Reliable audits and red-teaming can reduce these allocational inefficiencies.
  • Policy and regulatory design implications:
    • Mandates should require disclosure of demographic missingness and standardized sensitivity analyses (or require red-teaming outputs) so stakeholders can assess robustness, reducing information asymmetry in the market for AEDTs.
    • Regulators could lower long-term enforcement costs and market inefficiencies by incentivizing stewardship behaviors (e.g., safe-harbor for firms that publish robust red-team results; subsidies/credits for small firms to perform independent red-teaming).
  • Vendor competition and contracting:
    • Contract terms between employers and vendors should internalize incentives for high-quality audits (e.g., warranty clauses, penalties for undisclosed high missingness, provisions for third-party red-team testing). Firms that internalize these governance investments lower expected liability and may access higher-quality applicant pools.
  • Measurement / research implications for AI economics:
    • Published audit artifacts can be used as low-cost observables to study market behavior, investment in governance, and the incidence of symbolic compliance versus substantive mitigation—enabling further empirical work on the economic impacts of AI governance policies.

If useful, I can (a) produce a short checklist firms can use to convert LL144 audit outputs into red-team inputs, or (b) draft suggested regulatory reporting language (e.g., required DMR disclosure and sensitivity-test templates) to reduce symbolic compliance. Which would you prefer?

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper presents descriptive analysis of self-reported LL144 audit documents documenting demographic missingness, but does not establish causal links between audits and outcomes (e.g., improved fairness or enforcement), lacks access to underlying applicant-level data, and is vulnerable to reporting bias and limited scope. Methods Rigormedium — The study systematically inspects official audit reports and quantifies demographic missingness (with reported ranges), and applies institutional and stewardship theory to interpret findings; however, it relies on heterogeneous, self-reported documents without clear sampling/frame details, lacks verification against raw data or external benchmarks, and does not employ stronger quantitative identification or robustness checks. SampleEmployer-submitted Local Law 144 bias audit reports for AI-enabled hiring systems in New York City (aggregate-level audit outputs analyzed; reported demographic missingness across reports ranged from under 3% to over 50%); exact number of reports and time window not specified in summary. Themesgovernance labor_markets GeneralizabilityLimited to New York City employers subject to LL144 and their self-reported audit filings, Findings may not generalize to jurisdictions without mandatory audits or with different reporting formats/standards, Heterogeneity across firms (size, sector, vendor vs. in-house systems) may limit representativeness, Results describe audit reports, not direct evaluation of underlying hiring models or real-world hiring outcomes, Reporting bias and varying audit quality reduce ability to extrapolate to all AI hiring systems

Claims (6)

ClaimDirectionConfidenceOutcomeDetails
AI-enabled hiring systems are widely adopted. Adoption Rate positive high adoption of AI-enabled hiring systems
0.09
The fairness of AI-enabled hiring systems remains uncertain. Ai Safety And Ethics null_result high fairness of AI-enabled hiring systems
0.09
New York City’s Local Law 144 mandates annual bias audits to increase transparency. Governance And Regulation null_result high annual bias audit mandate (LL144)
0.3
An analysis of LL144 audit reports reveals demographic missingness ranging from under 3% to over 50%, which reduces the applicant pool used for fairness calculation and undermines the metrics. Ai Safety And Ethics negative high demographic data completeness (missingness) and its impact on fairness metric reliability
under 3% to over 50% missingness
0.18
The limitations in the audit reports reflect symbolic compliance (per institutional theory), while stewardship theory highlights potential for deeper accountability. Governance And Regulation mixed high interpretation of organizational motives (symbolic compliance vs. stewardship/accountability)
0.03
Audit outputs can be leveraged as red-teaming inputs to stress-test fairness robustness and strengthen AI governance through improved data quality and oversight (proposed intervention). Governance And Regulation positive high robustness of fairness assessments and strength of AI governance
0.03

Notes