Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

Summary

Main Finding

The paper develops PD–RSAC, a planning-aware, distributionally robust reinforcement learning (RL) framework for city-scale EV ride-hailing that (i) enforces hard physical constraints at training and deployment via a rolling MILP projection of policy “intentions” into feasible actions, and (ii) hedges against spatially correlated distributional shifts using a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric. In a large-scale NYC-taxi-derived simulator, PD–RSAC attains the highest net profit ($1.22M) and zero feeder-limit violations, substantially outperforming strong baselines (Greedy, SAC, MAPPO, MADDPG) which earn $0.58M–$0.70M.

Key Points

Problem formulation
- City region discretized into a hexagonal grid; fleet decisions: serve, reposition, charge.
- Semi-Markov decision process (semi-MDP) to model variable-duration actions (trips span multiple time steps; charging treated as 1-step actions for frequent re-evaluation).
- Aggregate hex-level state (idle/busy counts, avg SoC, demand inflow/outflow, infra utilization, feeder limits) to scale learning.
Feasibility-first action architecture
- Policy outputs “intentions” (mixed discrete mode + continuous charging power).
- Intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that enforces SoC safety, charger-port capacities, feeder power caps, assignment logic, minimum positive charging, and energy feasibility.
- If MILP times out with no incumbent, a deterministic greedy fallback builds a feasible action (priority to low-SoC vehicles). Hence actions executed are always physically admissible.
Robustness to distributional shift
- Optimize an SAC-based agent against a Wasserstein-1 ambiguity set around the empirical scenario distribution (origin–destination demand and travel times).
- Ground metric: graph-aligned Mahalanobis distance Q = diag(w) + β Q_graph (Laplacian-style), penalizing spatially disjoint perturbations more than local shifts—captures network topology and spatial correlations.
- Robust backup computed via Kantorovich–Rubinstein dual: inner minimization (adversary selecting worst-case scenario) solved approximately with projected (sub)gradient descent on scenarios; dual variable λ is adapted with a primal–dual risk-budget tracking rule to control realized adversarial radius.
Learning architecture & algorithms
- Shared two-layer GCN encoder for hex embeddings.
- Actor: Gumbel–Softmax for discrete mode selection (annealed temperature) + squashed Gaussian for continuous power; policy samples intentions.
- Twin critics + value network; value network used to guide adversary in inner loop.
- Duration-aware soft backup: critic targets incorporate γ^{Δ(action)} discount to account for multi-step action durations.
- The overall training loop stores feasible transitions (post-projection) in replay buffer and updates actor/critics using robust targets.
Empirical results
- Large-scale simulator built from NYC taxi data.
- PD–RSAC achieves net profit $1.22M and zero feeder-limit violations.
- Baselines (heuristic Greedy, standard SAC, MAPPO, MADDPG) earn $0.58M–$0.70M and/or violate feeder constraints.
- Architecture balances revenue, driving cost, electricity cost, and customer service (waiting/drops).

Data & Methods

Data / Simulator
- Synthetic/empirical scenarios derived from NYC taxi traces to produce OD demand intensity matrices and travel time fields over a hex grid; scenarios are stochastic and spatially correlated.
- Simulator explicitly tracks individual vehicles for state updates but uses aggregated hex-level state for learning.
Model specifics
- Semi-MDP with action durations: serve and reposition durations equal travel-time sums; charge actions modeled as 1-step with repeated steps representing longer sessions.
- SoC guard masks infeasible orders (ensures enough energy for pickup + trip + safety buffer).
- Reward combines trip revenue, driving/ops cost, electricity cost (dynamic station prices), waiting penalties and drop penalties.
MILP projection
- Objective: maximize revenue minus driving/electricity costs and a penalty for deviation from policy intention (L1 distance), subject to assignment, port capacities, feeder power cap, positive charging lower bound, and energy dynamics constraints.
- Solved with branch-and-bound under fixed time limit τ_max; incumbent returned if found, otherwise greedy fallback (Algorithm 1).
- Guarantees: projection always returns a feasible action (MILP incumbent or greedy fallback).
Robust optimization (WDRO)
- Ambiguity set: Wasserstein-1 ball of radius ρ under metric d_Q.
- Dual representation (Kantorovich–Rubinstein) yields a supremum over λ≥0 and an inner minimization over scenario perturbations.
- Inner loop approximated by projected (sub)gradient descent on scenario ξ: update uses ∇_ξ Vφ(s′(ξ)) (backprop through value net) and subgradient of d_Q; Δ(ξ) treated as piecewise constant (its gradient ignored).
- Primal–dual update: track realized radius ρ̂_t = d_Q(ξ*_t, ξ̂_t) and update λ_t toward a target ρ_target via diminishing stepsize to control adversary strength.
- Training architecture couples this adversary-driven robust target with SAC updates (entropy regularization, twin critics).
Network & optimization details
- Two-layer GCN encoder using normalized adjacency with self-loops; SiLU activations.
- Gumbel–Softmax for differentiable discrete sampling; temperature annealed during training.
- Twin Q-networks and a separate value network used as in SAC variants; value net used in constructing adversarial targets.

Implications for AI Economics

Operational value and cost savings
- Embedding constraint guarantees into learned policies enables RL deployment in infrastructure-sensitive domains (power feeders, charger capacity) where single violations can cause system-level failures and heavy penalties.
- The reported profit uplift (roughly 70–100% relative to strong baselines) suggests significant economic gains when combining feasible-action projections with robustness to distributional shifts—valuable for fleet operators optimizing revenue vs. energy/grid constraints.
Investment and infrastructure signals
- Feasibility-aware RL makes the marginal value of additional ports or feeder capacity explicit: the MILP/feasibility constraints and the empirical shortages encountered during projection/fallback can inform where infrastructure investments yield highest returns.
- Dynamic electricity pricing and the electricity cost term in reward mean learned policies implicitly perform cost-aware charging; such behavior can reduce peak feeder loads and lower energy bills, affecting utility pricing and grid planning.
Risk management & regulatory fit
- Wasserstein DRO with graph-aware metric produces policies robust to localized spatial demand shocks (e.g., events, transit delays). This lowers operational risk and reduces the probability of catastrophic feeder trips—important for regulators and grid operators.
- The primal–dual risk-budget mechanism provides an interpretable knob (ρ_target, λ) to trade robustness (worst-case protection) against average performance—this supports contractual or regulatory risk tolerances.
Deployment & scalability trade-offs
- Real-time requirement: rolling MILP solved at each decision step (even with time limits and fallback) introduces nontrivial computational cost. Operators must provision compute resources or relax time budgets, which trades off optimality vs. latency.
- The MILP fallback guarantees safety but may produce conservative or suboptimal actions when the solver cannot find incumbents; economic performance depends on solver reliability and τ_max tuning.
Policy design insights
- Separating intentions from feasible actions decouples learning from hard constraints—this modularity allows reuse of learned policies across different constraint regimes by swapping projection layers, useful economically when operating regions differ in infrastructure.
- Graph-aligned cost for ambiguity sets demonstrates that domain-aware metrics in DRO (vs. vanilla Euclidean) yield better robustness for networked economic systems (transport, energy) where perturbations are spatially structured.
Research and market implications
- Demonstrates a template for combining prescriptive optimization (MILP feasibility) with data-driven RL to obtain both high performance and hard safety guarantees—likely to accelerate RL adoption in regulated market segments (transportation, logistics, grid-interactive fleets).
- Opens avenues for market products: feasibility-projection-as-a-service, DRO-tuned fleet control suites, and decision-support tools that quantify the value of charging infrastructure upgrades.

Caveats & open questions (economic relevance) - Results are simulator-based (NYC taxi-derived); transfer to other cities, different demand patterns, or emerging mobility modes requires validation—economic gains may vary by market structure. - Computational cost of MILP projection and inner-loop adversary optimization matters for real-time operations; costs of additional compute should be weighed against profit gains. - The model treats battery SoC and charging costs but (appears to) omit battery degradation/economics over lifetime—incorporating degradation would affect charging scheduling and capex vs. opex tradeoffs. - Robustness choices (ρ, Q construction, λ dynamics) materially affect conservatism; operators need economic calibration of these hyperparameters to their risk preferences. - Equity and market impacts (e.g., service availability across neighborhoods) were not a primary focus—economic/regulatory assessments should consider fairness and externalities.

Summary: PD–RSAC blends constrained optimization and distributionally robust RL with graph-aware DRO to produce profitable, safety-guaranteed EV fleet control in a city-scale simulator. The technique has clear economic value for fleet operators and grid stakeholders but requires attention to computational costs, robustness calibration, and empirical transferability for real-world deployment.

Assessment

Paper Typedescriptive Evidence Strengthlow — Results are from a large-scale simulator built from NYC taxi data rather than real-world deployment or quasi-experimental variation; performance improvements are compelling in simulation but may not translate to actual fleets due to modeling assumptions, simulator fidelity, and unmodeled market/behavioral factors. Methods Rigorhigh — The paper combines a principled semi-MDP formulation, MILP-based hard constraint projection, a Wasserstein-robust SAC objective with a graph-aware ground metric, and a sophisticated GCN-based architecture, and it benchmarks against multiple strong baselines; experimental rigor is high within the simulated environment though some implementation and tuning details (sensitivity, ablations, and real-world validation) are not described here. SampleA large-scale EV fleet simulator constructed from NYC taxi trip data mapped onto a hex grid; the environment models spatially correlated demand and travel times, charger locations/ports and electrical feeder limits, state-of-charge dynamics, and a synthetic EV fleet whose dispatch, repositioning, and charging are controlled by the agent; numerical results compare PD--RSAC to heuristics and several RL baselines (Greedy, SAC, MAPPO, MADDPG). Themesproductivity innovation GeneralizabilitySimulator-based results may not hold in real-world operations due to simplifications and unmodeled phenomena (driver behavior, traffic dynamics, market interactions)., Built from NYC taxi data so demand patterns and infrastructure specifics may not generalize to other cities or contexts., Assumptions about charger power models, feeder configurations, and cost/revenue parameters may drive results and differ across operators., Robustness to adversarial or nonstationary real-world demand and rare events is tested only within the simulated ambiguity set., Scalability and real-time computational feasibility of MILP projection in live deployments may limit adoption.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. Other	null_result	high	problem formulation (hex-grid semi-MDP with mixed and continuous actions and variable durations)	0.03
The policy learns over high-level intentions produced by a masked, temperature-annealed actor. Other	null_result	high	policy representation (high-level intentions from masked, temperature-annealed actor)	0.03
These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. Other	null_result	high	constraint compliance via MILP projection (state-of-charge, port, feeder constraints)	0.18
To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. Other	null_result	high	robustness to distributional shift via Wasserstein-1 ambiguity set with graph-aligned Mahalanobis metric	0.18
The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Other	null_result	high	robust backup algorithm design and optimization procedure	0.03
Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Other	null_result	high	model architecture components (2-layer GCN encoder, twin critics, adversary-driven value network)	0.18
Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching $1.22M. Firm Revenue	positive	high	net profit	$1.22M 0.18
Strong heuristic, single-agent RL, and multi-agent RL baselines (including Greedy, SAC, MAPPO, and MADDPG) achieved net profit in the range $0.58M--$0.70M in the same experiments. Firm Revenue	negative	high	net profit of baseline methods (Greedy, SAC, MAPPO, MADDPG)	$0.58M--$0.70M 0.18
PD--RSAC maintained zero feeder-limit violations in the experiments. Regulatory Compliance	positive	high	number of feeder-limit violations	zero feeder-limit violations 0.18

A robust, MILP‑constrained RL controller more than doubles simulated net profit for an EV ride‑hailing fleet—$1.22M versus $0.58M–$0.70M for strong baselines—while strictly avoiding electrical feeder violations.