An equity-focused RL intake router for NYC building complaints can prioritize inspections and narrow historical service gaps by learning to escalate based on recurrence and neighborhood signals rather than raw volume; the approach shows promise in offline evaluation but lacks field-randomized validation of real-world impact.
Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity. The staff and heuristics available to triage, route, and prioritize complaints cannot scale with demand. This bottleneck produces differential service quality that follows income and racial lines (\cite{liu2024sla}). We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings (DOB) operational domains: boiler safety, crane and derrick oversight, heat and hot water complaints, housing complaint triage, scaffold safety, and Natural Area District (SNAD) protection. Rather than replacing human classifiers, our agents act as intelligent intake routers: learning to assign incoming complaints to action categories: escalate, batch, defer, inspect now. The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery. We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective. Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume. This finding has direct implications for complaint routing given the demographic correlates of those features.
Summary
Main Finding
An equity-centered reinforcement learning (RL) intake router can meaningfully expand complaint classification capacity across six NYC Department of Buildings (DOB) domains while actively reducing historical classification disparities—provided equity is encoded in the reward. Learned policies raise throughput and lower misclassification rates relative to rule-based heuristics, and the incremental operational cost of adding an explicit equity objective is modest (speed penalties on the Pareto frontier are small, roughly in the 4–7% range reported consistent with related SLA work). Importantly, post-hoc SHAP attribution shows recurrence and neighborhood-level features are stronger predictors of actionable violations than raw complaint volume—implying routing based on volume alone risks reproducing reporting biases.
Key Points
- Problem: Municipal 311 intake is a capacity-constrained classification system whose outputs (route, escalate, defer, batch) systematically under-serve lower-income and racially minoritized neighborhoods because complaint volumes are heterogeneously generated (measurement injustice / missing-not-at-random).
- Approach: Formalize six DOB complaint domains (boiler safety, crane/derrick, heat & hot water, housing triage, scaffold safety, SNAD) as Markov Decision Processes (MDPs) and train RL agents to act as intelligent intake routers that assign action categories per complaint.
- Equity as first-class objective: Design a multi-objective reward that explicitly includes an equitable coverage (coverage parity) term that normalizes escalation throughput by complaint volume and/or corrected reporting propensity.
- Performance: RL agents achieve higher throughput and lower misclassification rates than heuristic intake policies across multiple domains while enabling control over the equity–efficiency trade-off via reward weighting. The cost of explicit equity objectives is small on the Pareto frontier.
- Interpretability & audit: Use SHAP to audit learned policies; key findings: complaint recurrence and neighborhood-level statistics (e.g., complaint frequency, recurrence flags, ACS covariates) are more predictive of actionable violations than raw volume—these features also correlate with demographic composition and thus merit scrutiny as potential proxies.
- Practical recommendations: preprocess to correct for under-reporting (duplicate-report based corrections), include human-in-the-loop oversight, ongoing demographic audits, and participatory reward design with affected communities.
Data & Methods
- Domains & datasets:
- Boiler Safety: NYC DOB Safety Boiler dataset, ~834,338 records; 7 engineered features; binary actions (Defer / Inspect); 80/20 split (~667K train).
- Crane & Derrick: state features include recurrence, neighborhood frequency, backlog, hazard level; actions include Do Nothing / Routine / Immediate / Stop-Work.
- Heat & Hot Water, Housing Triage, Scaffold Safety, SNAD, Elevator complaints: domain-specific feature engineering; temporal splits for training/validation/backtest where applicable (e.g., scaffold: 2020–2024 train, 2024–2025 validation, 2025–Mar 2026 backtest).
- ACS augmentation: census-tract median income quintiles, racial/ethnic composition, language access, educational attainment, renter share used to proxy reporting propensity and for demographic audits.
- RL formalization:
- Each domain modeled as an MDP with state vectors capturing complaint attributes, recurrence, neighborhood-level statistics, and queue/backlog context.
- Action space corresponds to intake decisions (escalate/inspect now, batch, defer, ignore, etc.).
- Multi-objective scalarized reward: components include throughput (timely correct escalations), misclassification cost, operational cost, and an equitable coverage parity term (Equation 9 in paper) that normalizes escalation by complaint volume and/or corrected reporting propensity.
- Algorithms referenced: policy-gradient methods (REINFORCE family) and DQN-style value methods applied depending on domain; variance-reduction/baselines and multi-objective optimization techniques used to trace Pareto frontiers.
- Interpretability & audit:
- SHAP used post-hoc to attribute feature importance for escalations and actionable outcomes; demographic audits disaggregate policy outcomes by census-tract income quintile and racial composition.
- Important methodological caveats:
- Label and outcome data are missing not at random (fewer recorded outcomes in underserved areas).
- Complaint volume is a biased signal; the paper recommends upstream correction (duplicate-report estimators [Liu et al.]) before reward normalization.
- Temporal gaps and reporting heterogeneity introduce state-estimation uncertainty.
Implications for AI Economics
- Measurement injustice matters for resource allocation: When data-generating processes are correlated with socioeconomic status (reporting propensity, language access, trust), naively-optimized classifiers will allocate scarce public enforcement resources in ways that mirror pre-existing inequalities. Economically efficient policies that ignore measurement bias can still produce inequitable distributions of public goods (inspections, enforcement).
- Small economic trade-offs can buy large equity gains: The paper quantifies a modest efficiency cost for equity (Pareto frontier speed penalty ≈4–7%). From a public-economics perspective, that small reduction in throughput may be a Pareto-improving social choice when accounting for welfare gains to historically underserved populations and reduced externalities (health, displacement, worker safety).
- Feature choice and proxy risk: Neighborhood-level recurrence and complaint-frequency features improve predictive power but are correlated with demographics; these act as proxies and create potential disparate impacts. Economic models of algorithmic allocation should incorporate the cost of proxy-driven discrimination and the governance overhead of monitoring and correcting proxy use.
- Feedback loops and dynamic externalities: Routing policies that prioritize neighborhoods differently will change future reporting behavior and incident realizations (inspectors find violations, enforcement changes hazard incidence, citizens’ trust shifts). Economic evaluation of RL deployment must therefore model dynamic feedback and the value of information (and disinformation) over time.
- Governance and institutional design: Practical deployment requires aligning reward engineering with public values—participatory reward design, ongoing demographic audits, and human-in-the-loop mechanisms. From a public-finance viewpoint, investments in preprocessing (under-reporting correction) and auditing are complementary public goods that reduce the social cost of automated allocation.
- Policy recommendations with economic rationale:
- Correct for heterogeneous reporting rates before normalizing equity rewards (reduces measurement bias and improves allocation targeting).
- Make equity an explicit objective in scalarized reward functions; the marginal cost is small and social returns (reduced harm, improved fairness) can be large.
- Use interpretability tools (SHAP) to identify features that act as demographic proxies and to quantify trade-offs—this reduces informational asymmetries in regulator oversight.
- Design monitoring institutions for feedback effects: continuous evaluation prevents the algorithm from entrenching adverse equilibria in complaint filing and enforcement.
- Broader economic research directions:
- Quantify the welfare implications (distributional and aggregate) of RL-based intake versus status quo heuristics using counterfactual welfare analysis and dynamic models of reporting behavior.
- Study optimal investment trade-offs between classifier capacity, inspection capacity, and community outreach that affects reporting propensity.
- Explore mechanism design for participatory reward-setting where communities can express preferences over trade-offs between speed, cost, and equity.
Limitations noted by the authors: label missingness and non-random outcomes, the need for upstream reporting-rate corrections, sensitivity of results to normative choices in reward weighting, and the necessity of field trials and sustained governance before large-scale deployment.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity that produces a bottleneck and differential service quality that follows income and racial lines. Inequality | negative | high | differential service quality by income and race |
0.12
|
| We develop an equity-centered reinforcement learning (RL) framework that augments call classification capacity across six New York City Department of Buildings operational domains (boiler safety, crane and derrick oversight, heat and hot water, housing complaint triage, scaffold safety, and Natural Area District protection). Task Allocation | positive | high | call classification capacity / intake routing capability |
0.12
|
| Rather than replacing human classifiers, our agents act as intelligent intake routers that learn to assign incoming complaints to action categories: escalate, batch, defer, inspect now. Task Allocation | positive | high | complaint routing action assignment |
0.06
|
| The proposed technique is designed to maximize throughput, minimize misclassification cost, and actively narrow historical equity gaps in service delivery. Organizational Efficiency | positive | high | throughput; misclassification cost; historical equity gaps in service delivery |
0.02
|
| We formalize each domain as a Markov Decision Process (MDP) in which equitable classification coverage is a first-class reward objective. Task Allocation | positive | high | equitable classification coverage (as a modeled reward) |
0.12
|
| Post-hoc SHAP attribution reveals that complaint recurrence and neighborhood-level statistics are stronger predictors of actionable violations than raw complaint volume. Decision Quality | positive | high | predictive importance for actionable violations (feature importance) |
0.12
|
| The finding that recurrence and neighborhood statistics are stronger predictors than complaint volume has direct implications for complaint routing given the demographic correlates of those features. Task Allocation | mixed | high | implications for complaint routing policy/practice |
0.02
|