A hierarchical reinforcement-learning controller raises simulated omnichannel supply-chain performance over common heuristics under demand shocks, though it still trails a perfect-information benchmark.

Omnichannel Supply Chains Amid Demand Shocks: A Centralized Hierarchical Reinforcement Learning Framework

Panagiotis G. Giannopoulos, Thomas K. Dasaklis · April 14, 2026 · Logistics

openalex descriptive medium evidence 7/10 relevance DOI Source PDF

A centralized hierarchical RL controller with a capacity-aware state-action encoding improves simulated omnichannel supply-chain performance under demand shocks versus forecast-driven and greedy heuristics, though it remains below a perfect-information oracle.

Background: The rapid evolution of omnichannel retailing has reshaped retail supply chains (SCs) by coupling replenishment, fulfillment, and service decisions across multiple demand channels under inventory, lead-time, and capacity constraints. These interdependencies create coordination challenges, particularly when demand shocks interact with limited operational capacity. Methods: To address these challenges, this study develops a centralized Hierarchical Reinforcement Learning (HRL) control framework that makes decision timing explicit: replenishment and allocation are optimized weekly, while fulfillment and lateral inventory rebalancing are controlled daily. Policies are learned using Proximal Policy Optimization (PPO) in an actor–critic architecture, with bounded stochastic policies for constrained action spaces. To mitigate the curse of dimensionality in HRL, we introduce a capacity-aware state–action encoding mechanism that compresses the control interface into structured summary signals. Demand shocks are modeled using two specifications: a mixed profile, where half the products follow a uniform demand process and the rest a Merton-type jump-diffusion process, and a fully shock-driven profile. Results: The framework is evaluated against forecast-driven base-stock and greedy fulfillment heuristics, and a perfect-information oracle, with pairwise differences examined through Wilcoxon signed-rank tests. Conclusions: Overall, the proposed framework improves learning efficiency and scalability, outperforming heuristic baselines while remaining below the oracle bound.

Summary

Main Finding

A centralized hierarchical reinforcement learning (HRL) controller that explicitly separates decision timing (weekly replenishment/allocation vs. daily fulfillment/rebalancing), combined with a capacity-aware state–action encoding and Proximal Policy Optimization (PPO), yields more efficient learning and better operational performance in constrained omnichannel retail supply chains than common heuristics (forecast-driven base-stock and greedy fulfillment). The learned policies approach but do not reach a perfect-information oracle.

Key Points

Problem context: Omnichannel retail couples replenishment, fulfillment, and service decisions across channels under inventory, lead-time, and capacity constraints; coordination is crucial when demand shocks hit constrained operations.
HRL design: A centralized hierarchical controller makes weekly decisions for replenishment and allocation and daily decisions for fulfillment and lateral inventory rebalancing, making decision timing explicit and aligned with operational rhythms.
RL algorithm: Policies are learned with PPO in an actor–critic architecture; action outputs are bounded stochastic policies to respect constrained action spaces.
Dimensionality mitigation: Introduces a capacity-aware state–action encoding that compresses the control interface into structured summary signals, reducing the effective dimensionality of the HRL problem.
Demand shock modeling: Two demand specifications—
- Mixed profile: half the products follow a uniform demand process; half follow a Merton-type jump-diffusion process (to capture sudden jumps).
- Fully shock-driven profile: demand dominated by shocks.
Benchmarks and evaluation: Compared against forecast-driven base-stock and greedy fulfillment heuristics and a perfect-information oracle; pairwise differences tested with Wilcoxon signed-rank tests.
Results: The HRL framework outperforms heuristic baselines (improved learning efficiency and scalability) but remains below the oracle’s performance, demonstrating a gap to perfect foresight.

Data & Methods

Environment: Simulated omnichannel retail supply chain with multiple demand channels, inventory and lead-time dynamics, operational capacity limits, and the option for lateral inventory rebalancing.
Hierarchical control structure:
- Higher level (weekly): Replenishment decisions and allocation of incoming inventory to locations/channels.
- Lower level (daily): Fulfillment decisions and lateral rebalancing between locations.
Reinforcement learning:
- Algorithm: Proximal Policy Optimization (PPO).
- Architecture: Actor–critic.
- Action representation: Bounded stochastic policies to satisfy action constraints (e.g., nonnegativity, capacity).
State–action compression: A capacity-aware encoding mechanism that maps high-dimensional local states and actions into structured summary signals to ease learning and mitigate the curse of dimensionality in HRL.
Demand processes:
- Uniform demand for a subset of products.
- Merton-type jump-diffusion to model infrequent, large demand shocks for the remainder.
- Alternative scenario where demand is fully shock-driven.
Baselines and comparison:
- Forecast-driven base-stock policy.
- Greedy fulfillment heuristics.
- Perfect-information oracle as an upper bound.
Statistical analysis: Pairwise performance differences assessed using Wilcoxon signed-rank tests to evaluate significance.

Implications for AI Economics

Operational value of RL: Demonstrates that modern policy-gradient RL (PPO) combined with hierarchical structuring and tailored encodings can capture complex multi-timescale coordination problems in retail supply chains better than traditional heuristics, implying productivity gains from AI-driven operational control.
Investment justification: The approach suggests a potential ROI for retailers investing in RL-based orchestration—especially those facing frequent demand shocks and tight capacity—by improving service levels and resource utilization relative to simple heuristics.
Centralization vs decentralization: The study focuses on a centralized controller; economic implications include potential efficiency gains but also concentration of decision-making power and increased reliance on centralized data and compute infrastructure.
Robustness and risk: Modeling with jump processes highlights the importance of shock-aware policies. However, the gap to the oracle indicates remaining value in better demand forecasting, information-sharing, or hybrid approaches (combining predictive models with RL) to approach the upper bound.
Market structure and competition: Improved fulfillment efficiency can change competitive dynamics (e.g., lower stockouts, faster delivery), which may favor larger retailers able to deploy such AI systems unless similar tools become accessible to smaller firms.
Policy and labor considerations: Automation of complex coordination tasks may displace some operational roles but can also shift labor to exception handling and strategy; regulators and firms should anticipate transitional effects.
Research directions: Extend to decentralized/multi-agent architectures, integrate pricing or assortment decisions, test transfer/generalization across networks and real data, quantify welfare impacts, and evaluate costs of data, compute, and implementation for widespread adoption.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic simulation evidence that the HRL framework outperforms standard heuristics and quantifies differences with nonparametric tests; however, results are limited to simulated environments with specific demand-process assumptions, no real-world deployment or observational/experimental validation, and limited information on robustness to modeling choices. Methods Rigormedium — Methods use modern RL tools (hierarchical design, PPO actor–critic, bounded stochastic policies) and a capacity-aware encoding to address dimensionality, and performance is compared to reasonable baselines plus an oracle with statistical testing; but the paper (as described) lacks detail on hyperparameter tuning, comprehensive ablation/sensitivity analyses, real-data calibration, and scalability/compute cost reporting that would strengthen methodological credibility. SampleSimulated omnichannel retail supply-chain environments with hierarchical decision timing (weekly replenishment/allocation; daily fulfillment and lateral rebalancing), two demand-shock specifications: (1) mixed profile (half products follow a uniform demand process, half follow a Merton-type jump-diffusion) and (2) fully shock-driven profile; compared policies: centralized HRL (PPO-based) with capacity-aware state–action encoding, forecast-driven base-stock heuristics, greedy fulfillment heuristics, and a perfect-information oracle; evaluation uses Wilcoxon signed-rank pairwise tests across simulated instances. Themesproductivity innovation GeneralizabilityResults are based on simulated environments and may not transfer to real-world retail SCs without calibration to real demand, lead times, cost structures, and operational constraints., Demand processes (uniform and Merton jump-diffusion) are specific choices; performance may differ under alternative demand dynamics (seasonality, substitution, nonstationarity)., Scale and topology of the simulated network (number of products, nodes, capacities) are not specified here and may limit applicability to larger or more complex supply chains., Reward/cost formulations, service-level targets, and assumptions about information availability likely affect outcomes and may not match industry practices., Computational/training resource requirements and implementation frictions (latency, reliability, human override) are not assessed, limiting practical deployability., Comparisons use particular heuristic baselines; stronger industry benchmarks or hybrid human+AI policies might narrow observed gains.

Claims (6)

Claim	Direction	Confidence	Outcome	Details
The study develops a centralized Hierarchical Reinforcement Learning (HRL) control framework that makes decision timing explicit: replenishment and allocation are optimized weekly, while fulfillment and lateral inventory rebalancing are controlled daily. Other	null_result	high	decision timing policy (weekly replenishment/allocation; daily fulfillment/rebalancing)	0.3
Policies are learned using Proximal Policy Optimization (PPO) in an actor–critic architecture, with bounded stochastic policies to handle constrained action spaces. Other	null_result	high	learning algorithm and policy parameterization (PPO actor–critic with bounded stochastic policies)	0.3
To mitigate the curse of dimensionality in HRL, the paper introduces a capacity-aware state–action encoding mechanism that compresses the control interface into structured summary signals. Training Effectiveness	positive	high	state-action dimensionality reduction and improved scalability/learning efficiency	0.18
Demand shocks are modeled using two specifications: a mixed profile (half the products follow a uniform demand process and the rest follow a Merton-type jump-diffusion process) and a fully shock-driven profile. Other	null_result	high	demand process specification used in experiments (mixed vs fully shock-driven)	0.3
The framework is evaluated against forecast-driven base-stock and greedy fulfillment heuristics, and against a perfect-information oracle; pairwise differences are examined using Wilcoxon signed-rank tests. Other	null_result	high	evaluation protocol (comparators and statistical test used)	0.3
Overall, the proposed HRL framework improves learning efficiency and scalability, outperforming heuristic baselines while remaining below the perfect-information oracle bound. Organizational Efficiency	mixed	high	policy performance (learning efficiency, scalability, and supply-chain control performance relative to heuristics and oracle)	0.18