Inventory-theory regularization steadies deep-RL: grounding DRL policies in base-stock concepts speeds hyperparameter tuning and raises inventory performance, enabling a 100% deployment on Alibaba's Tmall; synthetic tests show regularization alters which DRL methods perform best.

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang · March 20, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Policy regularizations based on classical inventory concepts (e.g., base-stock) make DRL inventory policies more robust and faster to tune, improving final performance in simulation and enabling a full-scale production rollout on Alibaba's Tmall.

Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.

Summary

Main Finding

Imposing simple, inventory-theory grounded policy regularizations (a Base-stock form and a linear coefficient form, or their combination) on deep RL policies sharply reduces hyperparameter sensitivity, speeds up training convergence, and improves final performance for inventory replenishment. These regularizations enabled Alibaba to deploy a single, unified DRL policy (DDPG + regularizations) at 100% scale on Tmall (covering >1 million SKU–warehouse pairs).

Key Points

Policy regularizations
- Base regularization: π(It,xt) = max{µBase(It,xt) − tot(It), 0}, where µBase is learned and tot(It) is total incoming/on‑hand inventory (classic base-stock target).
- Coeff regularization: π(It,xt) = µCoeff(It,xt)⊤feat(xt), where µCoeff outputs coefficients multiplied by m′ demand/forecast features (Alibaba used m′=5: four demand features + bias).
- Both: π(It,xt) = max{µBoth(It,xt)⊤feat(xt) − tot(It), 0].
- These are action remappings (a structured parameterization) that bias learning toward sensible inventory behavior without strictly reducing expressiveness (the network can, in principle, undo the mapping).
Major empirical takeaways (Takeaways I–III)
Regularizations improve DRL performance, especially when hyperparameter tuning is limited; they reduce obvious blunders (e.g., ordering large amounts despite high on‑hand inventory or violating monotonicity with demand forecasts) and accelerate Q-function convergence.
Regularizations change the DRL-method comparison: they tilt the balance in favor of traditional DRL methods (DDPG, PPO) over Differentiable Simulator (DS), particularly when data (IID trajectories) or hyperparameter tuning is limited. DS can overfit when there are not enough parallel trajectories; with ample IID trajectories, DS matches traditional DRL.
Regularizations enabled a unified, full-scale DRL deployment at Alibaba (DDPG + regularizations), removing the need to cluster SKUs and train per-group models.
Practical deployment details and scale
- Deployed at Alibaba (Tmall) as of Oct 2025: 100% coverage of products sold on the platform, covering over 1 million SKU–warehouse combos.
- Alibaba observed DDPG (with regularizations) outperforming DS on its offline data (example: learning from 55,000 90-day SKU trajectories).
Metrics and trade-offs used in practice
- Synthetic experiments used standard underage/overage loss ℓ_bh.
- Alibaba evaluates using Stockout Rate ℓ_SR and Turnover Time ℓ_TT (weighted combination used during training/validation to obtain operational trade-offs).
Implementation details that mattered
- Normalize dynamic and inventory features per SKU; Coeff regularization multiplies by feat(xt) to denormalize actions, enabling meta-learning across SKUs with heterogeneous demand magnitudes.
- Policies trained with mainstream DRL algorithms: DDPG (off‑policy, Q-learning style), PPO (on‑policy policy-gradient), and DS (trajectory-level differentiable simulator).
Limitations / caveats noted by authors
- Simplified inventory dynamics (deterministic lead times, periodic order epochs) and inferred demand during stockouts—practical but imperfect modeling.
- Regularizations are problem-specific inductive biases; require domain knowledge and feature design (feat(xt)).

Data & Methods

Inventory model
- Discrete-time days t = 1..T; orders placed every P days, arrive after deterministic lead time L.
- Inventory state It = (I0_t, I1_t, ..., I_{L-1}_t): on‑hand and pipeline inventories; update rules follow standard lead‑time dynamics.
State and context
- Context xt ∈ R^m contains static (category, supplier, lead time, margins) and dynamic features (promotions, seasonality, forecasts).
- At Alibaba: feat(xt) was 5-dimensional (4 historical/forecast demand features + bias).
Policies and regularizations
- Unregularized: neural network outputs order quantity directly π(It,xt).
- Regularized forms as above (Base, Coeff, Both) map network outputs into inventory-meaningful actions (target stock levels or linear combos of demand features).
DRL training
- Methods: DDPG (off‑policy, learns Q-function), PPO (on‑policy policy gradients), DS (differentiable simulator computing gradients of the full trajectory loss).
- Datasets: D_train / D_validate / D_test of SKU trajectories (synthetic and Alibaba historical trajectories). Inferred dt where stockouts occurred.
- Objectives: synthetic experiments minimize ℓ_bh; Alibaba uses weighted ℓ_SR and ℓ_TT for training/evaluation.
- Hyperparameter studies: performance tracked as best-so-far across hyperparameter trials; regularization reduced sensitivity and improved best-so-far performance early in tuning.
Empirical evidence
- Synthetic experiments: show faster Q-function convergence and better early hyperparameter robustness under regularizations.
- Offline Alibaba data: regularized DDPG outperformed alternatives in routine tests.
- Real-world deployment: 100% rollout on Tmall using DDPG + regularizations; reported operational improvements (paper emphasizes deployment scale and practical viability).

Implications for AI Economics

Lowering the operational cost of ML adoption
- Embedding domain structure (here, base-stock and monotonicity with demand forecasts) materially reduces hyperparameter search and training instability, lowering labor and compute costs needed to deploy DRL in operations.
Increased reliability and interpretability
- Structured action mappings correspond to classical inventory concepts, making learned policies easier to audit, interpret, and trust—important for managerial buy‑in and regulatory/operational oversight.
Greater scalability and unified policies
- Regularizations enabled meta-learning across heterogeneous SKUs without clustering, enabling a single policy governing >1M SKU–warehouse pairs. Economically, this reduces model maintenance costs and allows platform-level optimization.
Competitive and strategic effects
- Firms that can integrate domain priors into ML workflows can more quickly and reliably operationalize RL, conferring a practical competitive advantage (faster rollout, lower cost, improved service levels).
Methodological lesson for AI in operations
- Combining classical structural insight from operations research with modern RL yields better sample efficiency and generalization than black‑box DRL alone—an approach broadly applicable to other operational domains (routing, pricing, workforce scheduling).
Trade-offs and risks
- Domain-specific inductive biases improve efficiency but risk brittleness if underlying assumptions are violated (e.g., highly stochastic lead times, unmodeled supply disruptions). Economists and practitioners should weigh gains in sample efficiency and interpretability against risks from model misspecification and data inference (e.g., inferred demands during stockouts).
Research directions
- Formalizing sample-complexity gains from these inductive biases; quantifying welfare or cost-savings at scale; exploring similar regularizations in other operational decision problems; studying robustness under richer stochastic lead-time and demand censoring models.

Summary judgement DeepStock demonstrates a practical and effective hybrid: encode inventory-theory structure into DRL policy parameterizations to get the best of both worlds—classical interpretability and regularized generalization plus the flexibility of neural policies—yielding large practical gains (reduced tuning burden, faster convergence, and a 100% real-world rollout at Alibaba). For AI economics, the key takeaway is that domain-informed inductive biases can materially lower the cost and risk of deploying RL in production operations.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports a full production deployment on Alibaba's Tmall and extensive synthetic experiments, which together provide substantive empirical evidence of improved DRL performance; however, there is no randomized or quasi-experimental identification of causal impact on business outcomes and limited detail on counterfactuals, statistical uncertainty, and potential confounders in the production rollout. Methods Rigormedium — The work combines theoretically motivated regularization with thorough simulation experiments and a real-world deployment, indicating solid engineering and empirical work; however, the absence of a randomized evaluation, limited reported metrics on statistical significance and robustness across heterogeneous settings, and likely reliance on proprietary system details constrain methodological rigor. SampleProprietary production data from Alibaba's Tmall e-commerce platform (100% deployment of the DRL policy-regularized system across the platform) together with extensive synthetic inventory simulation experiments exploring varying demand processes, lead times, cost parameters, and DRL hyperparameter settings. Themesproductivity adoption GeneralizabilitySingle-platform (Alibaba Tmall) production deployment — may not generalize to other retailers or supply-chain structures, Unknown which product categories, demand profiles, or geographic markets were dominant — results may depend on Tmall-specific demand dynamics, Proprietary implementation and engineering (system integration, feature pipelines, compute) may be hard to replicate in smaller firms, Synthetic experiments may not capture all real-world complexities (e.g., multi-echelon networks, nonstationary demand, promotions, supplier constraints), Outcomes reported focus on algorithmic performance/stability rather than broad firm-level or labor-market effects

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. Organizational Efficiency	positive	high	ability to train inventory policies using large data and compute	0.03
Off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. Output Quality	negative	high	sensitivity of DRL performance to hyperparameter choices (resulting in mixed success)	0.18
Imposing policy regularizations, grounded in classical inventory concepts such as 'Base Stock', can significantly accelerate hyperparameter tuning for DRL methods. Task Completion Time	positive	high	speed/efficiency of hyperparameter tuning	0.18
Imposing policy regularizations improves the final performance of several DRL methods for inventory management. Output Quality	positive	high	final performance (policy quality) of DRL inventory methods	0.18
The paper reports details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. Adoption Rate	positive	high	deployment/adoption of the DRL-with-regularization system	0.18
Extensive synthetic experiments show that policy regularizations reshape the narrative on what is the best DRL method for inventory management. Output Quality	mixed	high	relative performance/ranking of DRL methods for inventory management	0.18