An equity-focused RL intake router for NYC building complaints can prioritize inspections and narrow historical service gaps by learning to escalate based on recurrence and neighborhood signals rather than raw volume; the approach shows promise in offline evaluation but lacks field-randomized validation of real-world impact.

Scaling the Queue: Reinforcement Learning for Equitable Call Classification Capacity in NYC Municipal Complaint Systems

Irene Aldridge, Ellie Bae, Siddhesh Darak, Nicholas Donat, Akhil Fernando-Bell, Bella Ge, Nicholas Goguen-Compagnoni, Ishita Gupta, Ali Hasan, Pierce Hoenigman, Imran Isa-Dutse, Jiwon Jeong, Tishya Khanna, Neha Konduru, Yixuan Liu, Kai Maeda, Nolan McKenna, Karl Muller, Farzaan Naeem, Rishabh Patel, Zachary Sheldon, Ammar Syed, Nathan Tai, Michael Twersky, Haoying Wang, Zening Wang, Zexun Yao, Nadav Yochman · May 07, 2026

arxiv other low evidence 7/10 relevance Source PDF

The paper develops an equity-aware reinforcement learning intake router for six NYC building-complaint domains that prioritizes throughput and reduces historical service gaps, and finds via SHAP that complaint recurrence and neighborhood-level features predict actionable violations more strongly than raw complaint volume.

Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity. The staff and heuristics available to triage, route, and prioritize complaints cannot scale with demand. This bottleneck produces differential service quality that follows income and racial lines (\cite{liu2024sla}). We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings (DOB) operational domains: boiler safety, crane and derrick oversight, heat and hot water complaints, housing complaint triage, scaffold safety, and Natural Area District (SNAD) protection. Rather than replacing human classifiers, our agents act as intelligent intake routers: learning to assign incoming complaints to action categories: escalate, batch, defer, inspect now. The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery. We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective. Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume. This finding has direct implications for complaint routing given the demographic correlates of those features.

Summary

Main Finding

An equity-centered reinforcement learning (RL) intake router can meaningfully expand complaint classification capacity across six NYC Department of Buildings (DOB) domains while actively reducing historical classification disparities—provided equity is encoded in the reward. Learned policies raise throughput and lower misclassification rates relative to rule-based heuristics, and the incremental operational cost of adding an explicit equity objective is modest (speed penalties on the Pareto frontier are small, roughly in the 4–7% range reported consistent with related SLA work). Importantly, post-hoc SHAP attribution shows recurrence and neighborhood-level features are stronger predictors of actionable violations than raw complaint volume—implying routing based on volume alone risks reproducing reporting biases.

Key Points

Problem: Municipal 311 intake is a capacity-constrained classification system whose outputs (route, escalate, defer, batch) systematically under-serve lower-income and racially minoritized neighborhoods because complaint volumes are heterogeneously generated (measurement injustice / missing-not-at-random).
Approach: Formalize six DOB complaint domains (boiler safety, crane/derrick, heat & hot water, housing triage, scaffold safety, SNAD) as Markov Decision Processes (MDPs) and train RL agents to act as intelligent intake routers that assign action categories per complaint.
Equity as first-class objective: Design a multi-objective reward that explicitly includes an equitable coverage (coverage parity) term that normalizes escalation throughput by complaint volume and/or corrected reporting propensity.
Performance: RL agents achieve higher throughput and lower misclassification rates than heuristic intake policies across multiple domains while enabling control over the equity–efficiency trade-off via reward weighting. The cost of explicit equity objectives is small on the Pareto frontier.
Interpretability & audit: Use SHAP to audit learned policies; key findings: complaint recurrence and neighborhood-level statistics (e.g., complaint frequency, recurrence flags, ACS covariates) are more predictive of actionable violations than raw volume—these features also correlate with demographic composition and thus merit scrutiny as potential proxies.
Practical recommendations: preprocess to correct for under-reporting (duplicate-report based corrections), include human-in-the-loop oversight, ongoing demographic audits, and participatory reward design with affected communities.

Data & Methods

Domains & datasets:
- Boiler Safety: NYC DOB Safety Boiler dataset, ~834,338 records; 7 engineered features; binary actions (Defer / Inspect); 80/20 split (~667K train).
- Crane & Derrick: state features include recurrence, neighborhood frequency, backlog, hazard level; actions include Do Nothing / Routine / Immediate / Stop-Work.
- Heat & Hot Water, Housing Triage, Scaffold Safety, SNAD, Elevator complaints: domain-specific feature engineering; temporal splits for training/validation/backtest where applicable (e.g., scaffold: 2020–2024 train, 2024–2025 validation, 2025–Mar 2026 backtest).
- ACS augmentation: census-tract median income quintiles, racial/ethnic composition, language access, educational attainment, renter share used to proxy reporting propensity and for demographic audits.
RL formalization:
- Each domain modeled as an MDP with state vectors capturing complaint attributes, recurrence, neighborhood-level statistics, and queue/backlog context.
- Action space corresponds to intake decisions (escalate/inspect now, batch, defer, ignore, etc.).
- Multi-objective scalarized reward: components include throughput (timely correct escalations), misclassification cost, operational cost, and an equitable coverage parity term (Equation 9 in paper) that normalizes escalation by complaint volume and/or corrected reporting propensity.
- Algorithms referenced: policy-gradient methods (REINFORCE family) and DQN-style value methods applied depending on domain; variance-reduction/baselines and multi-objective optimization techniques used to trace Pareto frontiers.
Interpretability & audit:
- SHAP used post-hoc to attribute feature importance for escalations and actionable outcomes; demographic audits disaggregate policy outcomes by census-tract income quintile and racial composition.
Important methodological caveats:
- Label and outcome data are missing not at random (fewer recorded outcomes in underserved areas).
- Complaint volume is a biased signal; the paper recommends upstream correction (duplicate-report estimators [Liu et al.]) before reward normalization.
- Temporal gaps and reporting heterogeneity introduce state-estimation uncertainty.

Implications for AI Economics

Measurement injustice matters for resource allocation: When data-generating processes are correlated with socioeconomic status (reporting propensity, language access, trust), naively-optimized classifiers will allocate scarce public enforcement resources in ways that mirror pre-existing inequalities. Economically efficient policies that ignore measurement bias can still produce inequitable distributions of public goods (inspections, enforcement).
Small economic trade-offs can buy large equity gains: The paper quantifies a modest efficiency cost for equity (Pareto frontier speed penalty ≈4–7%). From a public-economics perspective, that small reduction in throughput may be a Pareto-improving social choice when accounting for welfare gains to historically underserved populations and reduced externalities (health, displacement, worker safety).
Feature choice and proxy risk: Neighborhood-level recurrence and complaint-frequency features improve predictive power but are correlated with demographics; these act as proxies and create potential disparate impacts. Economic models of algorithmic allocation should incorporate the cost of proxy-driven discrimination and the governance overhead of monitoring and correcting proxy use.
Feedback loops and dynamic externalities: Routing policies that prioritize neighborhoods differently will change future reporting behavior and incident realizations (inspectors find violations, enforcement changes hazard incidence, citizens’ trust shifts). Economic evaluation of RL deployment must therefore model dynamic feedback and the value of information (and disinformation) over time.
Governance and institutional design: Practical deployment requires aligning reward engineering with public values—participatory reward design, ongoing demographic audits, and human-in-the-loop mechanisms. From a public-finance viewpoint, investments in preprocessing (under-reporting correction) and auditing are complementary public goods that reduce the social cost of automated allocation.
Policy recommendations with economic rationale:
- Correct for heterogeneous reporting rates before normalizing equity rewards (reduces measurement bias and improves allocation targeting).
- Make equity an explicit objective in scalarized reward functions; the marginal cost is small and social returns (reduced harm, improved fairness) can be large.
- Use interpretability tools (SHAP) to identify features that act as demographic proxies and to quantify trade-offs—this reduces informational asymmetries in regulator oversight.
- Design monitoring institutions for feedback effects: continuous evaluation prevents the algorithm from entrenching adverse equilibria in complaint filing and enforcement.
Broader economic research directions:
- Quantify the welfare implications (distributional and aggregate) of RL-based intake versus status quo heuristics using counterfactual welfare analysis and dynamic models of reporting behavior.
- Study optimal investment trade-offs between classifier capacity, inspection capacity, and community outreach that affects reporting propensity.
- Explore mechanism design for participatory reward-setting where communities can express preferences over trade-offs between speed, cost, and equity.

Limitations noted by the authors: label missingness and non-random outcomes, the need for upstream reporting-rate corrections, sensitivity of results to normative choices in reward weighting, and the necessity of field trials and sustained governance before large-scale deployment.

Assessment

Paper Typeother Evidence Strengthlow — The paper proposes and evaluates an equity-centered RL routing system using historical DOB complaint data and model explainability (SHAP), but it does not present a causal identification strategy, randomized or quasi-experimental deployment, or field validation showing that the agent causally improves service equity or outcomes in practice; results are based on offline modeling and post-hoc attributions, which are vulnerable to selection bias and label bias. Methods Rigormedium — The authors formalize each operational domain as an MDP, train RL agents with an explicit equity-aware reward, and use SHAP to interpret predictors, which is methodologically solid for an applied ML paper; however, rigor is limited by lack of out-of-sample field deployment, no randomized or natural-experiment validation, potential sensitivity to reward specification and distributional shift, and likely dependence on administrative label quality. SampleAdministrative complaint intake records from the New York City Department of Buildings, covering six operational domains (boiler safety, crane and derrick oversight, heat and hot water complaints, housing complaint triage, scaffold safety, and Natural Area District protection); features include complaint metadata, recurrence history, neighborhood-level statistics/demographics, and downstream labels indicating whether violations or actionable inspections occurred (timeframe and sample sizes not specified in the summary). Themeshuman_ai_collab inequality productivity adoption GeneralizabilitySingle-city (NYC) administrative data may not generalize to other cities or regulatory contexts, Results specific to DOB complaint types and workflows; other municipal services may have different label quality and operational constraints, Dependence on historical complaint labels and neighborhood statistics introduces risk of propagating past biases, Performance may degrade under distributional shift or when human operators change behavior in response to the agent, Success depends on human-in-the-loop adoption, which varies across agencies and unions

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity that produces a bottleneck and differential service quality that follows income and racial lines. Inequality	negative	high	differential service quality by income and race	0.12
We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings operational domains (boiler safety, crane and derrick oversight, heat and hot water, housing complaint triage, scaffold safety, and Natural Area District protection). Task Allocation	positive	high	call classification capacity / intake routing capability	0.12
Rather than replacing human classifiers, our agents act as intelligent intake routers that learn to assign incoming complaints to action categories: escalate, batch, defer, inspect now. Task Allocation	positive	high	complaint routing action assignment	0.06
The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery. Organizational Efficiency	positive	high	throughput; misclassification cost; historical equity gaps in service delivery	0.02
We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective. Task Allocation	positive	high	equitable classification coverage (as a modeled reward)	0.12
Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume. Decision Quality	positive	high	predictive importance for actionable violations (feature importance)	0.12
The finding that recurrence and neighborhood statistics are stronger predictors than complaint volume has direct implications for complaint routing given the demographic correlates of those features. Task Allocation	mixed	high	implications for complaint routing policy/practice	0.02