AgentFairBench: Do LLM Agents Discriminate When They Act?

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

Summary

Main Finding

AgentFairBench introduces a reproducible, low-cost benchmark and evaluation harness that measures demographic disparity in the actions of LLM agents (not just their text outputs). Using counterfactual matched profiles that vary only a name-coded race×gender signal and a scaffold axis (C0–C4) that operationalizes agent depth, the benchmark exposes action-level disparities (or their absence) across three consequential domains (hiring, lending, medical triage). A pilot audit of claude-haiku-4-5 (864 decisions with test–retest) found no demographic effect above an arity-matched sampling-noise floor (0 of 120 pairwise and 0 of 9 omnibus contrasts survived correction). The authors also introduce the arity-matched-null methodology to avoid overstating disparity due to statistic arity, and provide open code, data, and a live leaderboard.

Key Points

Motivation: Answer-level fairness tests (what models say) can miss allocative harms that arise when models act. The Bias Conduction Framework (BCF) formalizes how disparity can propagate through policy, memory, tools, and scaffolding into actions.
Benchmark scope:
- Domains: hiring, lending, medical triage — each anchored to real regulatory standards (EEOC, NYC Local Law 144, ECOA/Reg B).
- Action outputs: binary decision + graded score per domain (e.g., advance/not + 0–100 hiring score; approve/not + APR tier; triage escalation + acuity).
- Demographic perturbation: synthetic, demographic-neutral profiles with Bertrand–Mullainathan style name swaps to vary perceived race×gender only.
- Scaffold axis: C0 (direct), C2 (chain-of-thought), C3 (multi-agent deliberation), C4 (tool-augmented), enabling tests of BCF’s P2 (Masking) and P3 (Super-additivity).
Metrics: counterfactual flip rate (CFR), mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity (Δtool).
Statistical rigor: BCa bootstrap CIs, paired McNemar and Wilcoxon tests, Benjamini–Hochberg FDR correction, omnibus group tests; introduces arity-matched-null to correct for increased false signal from comparing multi-group spreads to two-run noise.
Pilot results: claude-haiku-4-5 showed no statistically significant demographic disparities once arity-matched null and multiple-testing correction applied. A planted-bias test demonstrates the harness can detect disparity when present.
Practicalities: NumPy-only harness, single-digit-dollar cost per model for runs, live leaderboard with held-out private split and contamination canary to reduce gaming, released under open licenses.

Data & Methods

Design:
- Counterfactual matched-set design: synthetic applicant/patient/borrower profiles are identical except for name-coded race×gender cues.
- Domains and outputs: hiring (advance boolean, 0–100 score), lending (approve boolean, APR tier 1–5), triage (escalate boolean, acuity 1–5).
- Scaffold levels map to BCF components to probe where disparity may be conducted or amplified.
Pilot scale:
- Total decisions: 864 in the pilot (with a second run for test–retest reliability).
- Per-cell sample: n = 12 matched sets per cell in the reported pilot.
- Pilot findings: comparing a six-group spread to a two-run noise baseline overstates disparity by ~2.4×; after correcting via arity-matched-null and omnibus testing, no significant effects for the tested model (0/120 pairwise; 0/9 omnibus contrasts survived correction).
Metrics & computation:
- CFR: fraction of counterfactual pairs where the binary action flips across demographic perturbation.
- MASD: mean absolute difference in continuous/graded scores across counterfactual pairs.
- Δtool: disparity in tool-invocation behavior (newly instrumented here).
- Implementation: NumPy-only harness, BCa bootstrap CIs, paired McNemar and Wilcoxon tests, BH FDR control, and planted-bias property tests for sensitivity.
Reproducibility & anti-gaming:
- Live leaderboard with a held-out private split, contamination canary, submission protocol, and content-hash checks.
- All artifacts (code, data, harness) released under open licenses.

Implications for AI Economics

Measurement matters for allocation outcomes: economic allocations (jobs, credit, care) depend on agent actions, not just model utterances. Benchmarks that only measure token-level parity risk under-detecting systematic allocation differences that affect incomes, employment, credit costs, and health outcomes.
Scaffold design can affect distributional outcomes: the BCF and scaffold axis imply that adding deliberation, tool use, or multi-agent stages can amplify (or attenuate) disparities. This makes agent architecture (not only training data/model weights) an economic design choice with distributional consequences.
Low-cost, reproducible audits enable broader monitoring and compliance: single-digit-dollar per-model evaluation and an open harness lower the cost barrier for regulators, firms, and researchers to run audits on deployed systems — facilitating market-level oversight, compliance testing anchored to regulatory standards, and ex-ante checks in procurement.
Policy and regulation:
- Action-level audits align with existing regulatory criteria (EEOC, ECOA/Reg B, Local Law 144) and can inform compliance testing for automated decision tools.
- The arity-matched-null methodology reduces false positives in audit findings, improving the evidentiary quality of fairness claims used in enforcement or consumer-protection contexts.
Economic welfare and distributional analysis:
- Metrics like MASD (graded score shifts) and Δtool (differential tool invocation) capture harms that translate into monetary or welfare losses (e.g., higher APR tiers, fewer interview invites, delayed care escalation).
- Understanding where disparities are conducted (policy vs. memory vs. tools vs. scaffold) suggests targeted interventions with different costs and economic tradeoffs (e.g., altering prompts/scaffold vs. retraining models vs. changing tool repertoires).
Research and market directions:
- Audits should be scaled (larger n, more attributes, real-world profiles) before strong market claims; the pilot’s null is limited to one model and modest sample sizes.
- Incorporate richer counterfactuals and causal attribution (beyond name proxies) to connect observed disparities to underlying structural causes and estimate economic impacts.
- Firms should instrument tool-invocation and scaffold behavior in production monitoring to detect emerging allocation disparities as agents evolve.
Caveats:
- The paper uses synthetic name-based perturbations (correspondence audit), not full SCM-style causal interventions; real-world perceptions and correlated covariates might produce different effects.
- Pilot scope is limited (one model, n=12 per cell); absence of detected disparity is not proof of absence across models or at larger scale.

Suggested uses by economists and policy analysts - Use AgentFairBench as an ex-ante compliance test for automated hiring/credit/triage agents. - Incorporate action-level audit results into cost-benefit and distributional-impact assessments of deploying agentic systems. - Use scaffold experiments to evaluate trade-offs between capability gains and potential amplification of disparities. - Employ the arity-matched-null approach in empirical studies to avoid overstating group disparities when comparing multi-group spreads to simple two-run baselines.

Overall, AgentFairBench supplies a practical, statistically principled instrument to bring action-level fairness measurement into empirical economic and regulatory analysis of LLM-driven agents.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The benchmark implements a rigorous counterfactual testing pipeline and shows the instrument detects planted bias, but the empirical evidence comes from a modest pilot (864 decisions plus replication) on synthetic profiles and a limited set of models/agent scaffolds, so external validity to deployed, real-world settings and varied populations is limited. Methods Rigorhigh — The study uses principled counterfactual matching, multiple disparity metrics (flip rate, MASD, action-rate and tool-invocation disparity), bootstrap confidence intervals, paired tests, false-discovery-rate control, an arity-matched null to avoid statistic-driven overstatement, and a planted-bias check—methods that are appropriate and careful for the stated measurement goals—though they do not eliminate external/ecological validity concerns. SampleSynthetic, demographic-neutral profiles across three regulator-anchored domains (hiring, lending, medical triage) evaluated in counterfactual matched sets varying only name-coded race and gender; pilot dataset comprises 864 decisions plus a test–retest replication, evaluated under four agent scaffolds (direct, chain-of-thought, multi-agent deliberation, tool-augmented) and includes a held-out private split, contamination canary, and a live leaderboard for external models (pilot results reported for Claude Haiku 4.5). Themesgovernance inequality IdentificationCounterfactual matched-profile design: synthetic, demographic-neutral applicant/borrower/patient profiles are created and only a name-coded race×gender signal is varied (Bertrand-Mullainathan style) to isolate demographic effects on agent actions; tests contrast matched pairs/sets across four agent scaffolds, use an arity-matched null (to control statistic arity-driven noise), bootstrap CIs, paired tests, FDR correction, and a planted-bias positive control to validate sensitivity. GeneralizabilitySynthetic profiles may not capture real-world resume/borrower/patient complexity or correlated signals beyond names, Name-only demographic signaling omits multimodal and contextual cues present in deployments, Pilot covers a limited set of LLMs/agent scaffolds and modest sample size, Domains tested (hiring, lending, triage) are regulator-anchored but not exhaustive of all decision contexts, Bench judgments in a lab harness may differ from behavior in live, adaptive systems or integrated decision pipelines, Cultural, linguistic, and regional variation in names and demographic signals not fully represented

Claims (13)

Claim	Direction	Outcome	Confidence & Evidence	Details
We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Ai Safety And Ethics	positive	availability of a benchmark for measuring demographic disparity in LLM agent actions	Reading fidelity high Study strength medium	0.48
AgentFairBench is grounded in a companion framework, the Bias Conduction Framework (BCF), restated here. Ai Safety And Ethics	positive	use of BCF as conceptual framework for instrument	Reading fidelity high Study strength low	0.24
AgentFairBench spans three regulator-anchored domains: hiring, lending, and medical triage. Ai Safety And Ethics	positive	coverage of domains in the benchmark	Reading fidelity high Study strength medium	0.48
Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition). Ai Safety And Ethics	positive	use of counterfactual matched sets that vary only by name-coded race and gender	Reading fidelity high Study strength medium	0.48
The benchmark evaluates agents under four agent scaffolds of increasing agency: direct, chain-of-thought, multi-agent deliberation, and tool-augmented. Ai Safety And Ethics	positive	variation of agent scaffolds used in evaluation	Reading fidelity high Study strength medium	0.48
A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. Ai Safety And Ethics	positive	computed disparity metrics and per-model monetary cost to run the harness	Reading fidelity medium Study strength medium	single-digit dollars per model 0.29
A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Adoption Rate	positive	availability of leaderboard and submission infrastructure	Reading fidelity high Study strength medium	0.48
Our pilot comprises 864 decisions plus a test-retest replication. Other	neutral	number of decisions in pilot	Reading fidelity high Study strength high	n=864 0.8
Comparing a six-group score spread against a two-run noise difference overstates disparity by approximately 2.4X through statistic arity alone. Ai Safety And Ethics	negative	degree of overstatement of disparity when using higher-arity statistic versus two-run noise baseline	Reading fidelity high Study strength medium	n=864 ~2.4X 0.48
Against an arity-matched noise floor and an omnibus group test, Claude Haiku 4.5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction). Ai Safety And Ethics	null_result	demographic disparities in actions (pairwise contrasts and omnibus contrasts)	Reading fidelity high Study strength medium	n=864 0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction 0.48
A planted-bias test confirms the instrument detects disparity when present. Ai Safety And Ethics	positive	ability of instrument to detect intentionally planted bias	Reading fidelity high Study strength medium	0.48
The paper contributes a sound, sensitive, adoption-ready instrument, the arity-matched null methodology, and open artifacts to scale it. Ai Safety And Ethics	positive	availability and readiness of instrument and methodology	Reading fidelity high Study strength low	0.24
Code, data, and harness are released under open licenses, with an anonymized review artifact. Adoption Rate	positive	availability of open-source artifacts and anonymized review artifact	Reading fidelity high Study strength high	0.8

A low-cost benchmark shows no detectable race/gender disparity in Claude Haiku 4.5’s agent actions in a pilot across hiring, lending and triage, and introduces a sensitive, reproducible toolkit (AgentFairBench) for uncovering action-level bias when present.