A low-cost benchmark shows no detectable race/gender disparity in Claude Haiku 4.5’s agent actions in a pilot across hiring, lending and triage, and introduces a sensitive, reproducible toolkit (AgentFairBench) for uncovering action-level bias when present.
Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.
Summary
Main Finding
AgentFairBench introduces a reproducible, low-cost benchmark and evaluation harness that measures demographic disparity in the actions of LLM agents (not just their text outputs). Using counterfactual matched profiles that vary only a name-coded race×gender signal and a scaffold axis (C0–C4) that operationalizes agent depth, the benchmark exposes action-level disparities (or their absence) across three consequential domains (hiring, lending, medical triage). A pilot audit of claude-haiku-4-5 (864 decisions with test–retest) found no demographic effect above an arity-matched sampling-noise floor (0 of 120 pairwise and 0 of 9 omnibus contrasts survived correction). The authors also introduce the arity-matched-null methodology to avoid overstating disparity due to statistic arity, and provide open code, data, and a live leaderboard.
Key Points
- Motivation: Answer-level fairness tests (what models say) can miss allocative harms that arise when models act. The Bias Conduction Framework (BCF) formalizes how disparity can propagate through policy, memory, tools, and scaffolding into actions.
- Benchmark scope:
- Domains: hiring, lending, medical triage — each anchored to real regulatory standards (EEOC, NYC Local Law 144, ECOA/Reg B).
- Action outputs: binary decision + graded score per domain (e.g., advance/not + 0–100 hiring score; approve/not + APR tier; triage escalation + acuity).
- Demographic perturbation: synthetic, demographic-neutral profiles with Bertrand–Mullainathan style name swaps to vary perceived race×gender only.
- Scaffold axis: C0 (direct), C2 (chain-of-thought), C3 (multi-agent deliberation), C4 (tool-augmented), enabling tests of BCF’s P2 (Masking) and P3 (Super-additivity).
- Metrics: counterfactual flip rate (CFR), mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity (Δtool).
- Statistical rigor: BCa bootstrap CIs, paired McNemar and Wilcoxon tests, Benjamini–Hochberg FDR correction, omnibus group tests; introduces arity-matched-null to correct for increased false signal from comparing multi-group spreads to two-run noise.
- Pilot results: claude-haiku-4-5 showed no statistically significant demographic disparities once arity-matched null and multiple-testing correction applied. A planted-bias test demonstrates the harness can detect disparity when present.
- Practicalities: NumPy-only harness, single-digit-dollar cost per model for runs, live leaderboard with held-out private split and contamination canary to reduce gaming, released under open licenses.
Data & Methods
- Design:
- Counterfactual matched-set design: synthetic applicant/patient/borrower profiles are identical except for name-coded race×gender cues.
- Domains and outputs: hiring (advance boolean, 0–100 score), lending (approve boolean, APR tier 1–5), triage (escalate boolean, acuity 1–5).
- Scaffold levels map to BCF components to probe where disparity may be conducted or amplified.
- Pilot scale:
- Total decisions: 864 in the pilot (with a second run for test–retest reliability).
- Per-cell sample: n = 12 matched sets per cell in the reported pilot.
- Pilot findings: comparing a six-group spread to a two-run noise baseline overstates disparity by ~2.4×; after correcting via arity-matched-null and omnibus testing, no significant effects for the tested model (0/120 pairwise; 0/9 omnibus contrasts survived correction).
- Metrics & computation:
- CFR: fraction of counterfactual pairs where the binary action flips across demographic perturbation.
- MASD: mean absolute difference in continuous/graded scores across counterfactual pairs.
- Δtool: disparity in tool-invocation behavior (newly instrumented here).
- Implementation: NumPy-only harness, BCa bootstrap CIs, paired McNemar and Wilcoxon tests, BH FDR control, and planted-bias property tests for sensitivity.
- Reproducibility & anti-gaming:
- Live leaderboard with a held-out private split, contamination canary, submission protocol, and content-hash checks.
- All artifacts (code, data, harness) released under open licenses.
Implications for AI Economics
- Measurement matters for allocation outcomes: economic allocations (jobs, credit, care) depend on agent actions, not just model utterances. Benchmarks that only measure token-level parity risk under-detecting systematic allocation differences that affect incomes, employment, credit costs, and health outcomes.
- Scaffold design can affect distributional outcomes: the BCF and scaffold axis imply that adding deliberation, tool use, or multi-agent stages can amplify (or attenuate) disparities. This makes agent architecture (not only training data/model weights) an economic design choice with distributional consequences.
- Low-cost, reproducible audits enable broader monitoring and compliance: single-digit-dollar per-model evaluation and an open harness lower the cost barrier for regulators, firms, and researchers to run audits on deployed systems — facilitating market-level oversight, compliance testing anchored to regulatory standards, and ex-ante checks in procurement.
- Policy and regulation:
- Action-level audits align with existing regulatory criteria (EEOC, ECOA/Reg B, Local Law 144) and can inform compliance testing for automated decision tools.
- The arity-matched-null methodology reduces false positives in audit findings, improving the evidentiary quality of fairness claims used in enforcement or consumer-protection contexts.
- Economic welfare and distributional analysis:
- Metrics like MASD (graded score shifts) and Δtool (differential tool invocation) capture harms that translate into monetary or welfare losses (e.g., higher APR tiers, fewer interview invites, delayed care escalation).
- Understanding where disparities are conducted (policy vs. memory vs. tools vs. scaffold) suggests targeted interventions with different costs and economic tradeoffs (e.g., altering prompts/scaffold vs. retraining models vs. changing tool repertoires).
- Research and market directions:
- Audits should be scaled (larger n, more attributes, real-world profiles) before strong market claims; the pilot’s null is limited to one model and modest sample sizes.
- Incorporate richer counterfactuals and causal attribution (beyond name proxies) to connect observed disparities to underlying structural causes and estimate economic impacts.
- Firms should instrument tool-invocation and scaffold behavior in production monitoring to detect emerging allocation disparities as agents evolve.
- Caveats:
- The paper uses synthetic name-based perturbations (correspondence audit), not full SCM-style causal interventions; real-world perceptions and correlated covariates might produce different effects.
- Pilot scope is limited (one model, n=12 per cell); absence of detected disparity is not proof of absence across models or at larger scale.
Suggested uses by economists and policy analysts - Use AgentFairBench as an ex-ante compliance test for automated hiring/credit/triage agents. - Incorporate action-level audit results into cost-benefit and distributional-impact assessments of deploying agentic systems. - Use scaffold experiments to evaluate trade-offs between capability gains and potential amplification of disparities. - Employ the arity-matched-null approach in empirical studies to avoid overstating group disparities when comparing multi-group spreads to simple two-run baselines.
Overall, AgentFairBench supplies a practical, statistically principled instrument to bring action-level fairness measurement into empirical economic and regulatory analysis of LLM-driven agents.
Assessment
Claims (13)
| Claim | Direction | Outcome | Confidence & Evidence | Details |
|---|---|---|---|---|
| We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Ai Safety And Ethics | positive | availability of a benchmark for measuring demographic disparity in LLM agent actions |
Reading fidelity
high
Study strength
medium
|
|
| AgentFairBench is grounded in a companion framework, the Bias Conduction Framework (BCF), restated here. Ai Safety And Ethics | positive | use of BCF as conceptual framework for instrument |
Reading fidelity
high
Study strength
low
|
|
| AgentFairBench spans three regulator-anchored domains: hiring, lending, and medical triage. Ai Safety And Ethics | positive | coverage of domains in the benchmark |
Reading fidelity
high
Study strength
medium
|
|
| Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition). Ai Safety And Ethics | positive | use of counterfactual matched sets that vary only by name-coded race and gender |
Reading fidelity
high
Study strength
medium
|
|
| The benchmark evaluates agents under four agent scaffolds of increasing agency: direct, chain-of-thought, multi-agent deliberation, and tool-augmented. Ai Safety And Ethics | positive | variation of agent scaffolds used in evaluation |
Reading fidelity
high
Study strength
medium
|
|
| A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. Ai Safety And Ethics | positive | computed disparity metrics and per-model monetary cost to run the harness |
Reading fidelity
medium
Study strength
medium
|
single-digit dollars per model
|
| A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Adoption Rate | positive | availability of leaderboard and submission infrastructure |
Reading fidelity
high
Study strength
medium
|
|
| Our pilot comprises 864 decisions plus a test-retest replication. Other | neutral | number of decisions in pilot |
Reading fidelity
high
Study strength
high
|
n=864
|
| Comparing a six-group score spread against a two-run noise difference overstates disparity by approximately 2.4X through statistic arity alone. Ai Safety And Ethics | negative | degree of overstatement of disparity when using higher-arity statistic versus two-run noise baseline |
Reading fidelity
high
Study strength
medium
|
n=864
~2.4X
|
| Against an arity-matched noise floor and an omnibus group test, Claude Haiku 4.5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction). Ai Safety And Ethics | null_result | demographic disparities in actions (pairwise contrasts and omnibus contrasts) |
Reading fidelity
high
Study strength
medium
|
n=864
0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction
|
| A planted-bias test confirms the instrument detects disparity when present. Ai Safety And Ethics | positive | ability of instrument to detect intentionally planted bias |
Reading fidelity
high
Study strength
medium
|
|
| The paper contributes a sound, sensitive, adoption-ready instrument, the arity-matched null methodology, and open artifacts to scale it. Ai Safety And Ethics | positive | availability and readiness of instrument and methodology |
Reading fidelity
high
Study strength
low
|
|
| Code, data, and harness are released under open licenses, with an anonymized review artifact. Adoption Rate | positive | availability of open-source artifacts and anonymized review artifact |
Reading fidelity
high
Study strength
high
|