A behaviourally informed RL model outperforms common heuristics at predicting shopper movement in a convenience store, and produces layout changes that recover the same sales gains as those suggested by real customer paths; the method offers a low-data, practical alternative to costly trajectory collection.
Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.
Summary
Main Finding
Maximum-entropy reinforcement learning (MaxEnt RL) produces simulated in-store customer trajectories that align substantially better with real-world movement than commonly used heuristics (Travelling Salesman Problem and Probabilistic Nearest Neighbour). This improved behavioural fidelity yields more accurate shelf-traffic and impulse-purchase estimates and—critically—leads to layout change recommendations (single-product repositioning) whose profit gains match those produced by ground-truth trajectory data. RL therefore offers a practical, behaviourally grounded middle ground between expensive trajectory collection and oversimplified heuristics.
Key Points
- Problem: Real customer trajectories are costly to collect; heuristics (TSP, PNN) are cheap but unrealistic (customers deviate ~28% from shortest paths).
- Proposal: Model customers as conditional MaxEnt RL agents trained with PPO to balance reward-seeking and stochasticity (bounded rationality).
- Performance: RL trajectories match human data better than TSP and PNN according to Jensen–Shannon divergence (JSD) and Wasserstein distance (WD).
- Example numbers (trajectory heatmaps): JSD — TSP 0.657, PNN 0.580, RL 0.415; WD — TSP 0.0140, PNN 0.0120, RL 0.00800.
- Downstream impact: RL-derived trajectories produce more accurate shelf-traffic density maps and impulse-rate estimates (lower JSD/WD to ground truth than TSP/PNN), and only RL-informed product repositioning produced layout choices and estimated profits comparable to using real trajectories.
- Practicality: The RL agent is conditioned on basket, checkout location, and optional timestep budget; a digital twin / gridworld (16×36, 50 cm cells) is used for training and rollouts. Source code and demo are publicly available.
Data & Methods
- Real-world data:
- Source: Overhead camera array at a convenience store; anonymized 3D joint coordinates at 5 Hz with associated checkout baskets and layout metadata.
- Preprocessed dataset: 3,054 trajectories after alignment to store boundaries, discretization to 2D grid cells, trimming, and mapping pick-ups.
- Focus products: Top 61 best-selling items grouped into 11 categories (cover ~51% of sales); one category per shelf in the model.
- Environment:
- Gridworld discretization matching physical store: 16×36 grid, cells = 50×50 cm.
- Two checkout locations, walls/shelves encoded, action space = {forward, turn left, turn right, pickup/checkout}.
- RL model:
- MaxEnt objective via PPO; convolutional neural backbone; conditional inputs include basket, checkout, timestep budget, visit-mask, category map, step count, and agent pose.
- Training features: curriculum over basket size (0–5), parallelized environments, per-channel normalization, γ = 1.0 (no trajectory-length discount), bonus for unique-state visits to encourage exploration.
- Trajectory generation: rollouts conditioned on baskets; only trajectories above a reward threshold retained.
- Baselines:
- TSP: compute global shortest path visiting required items (checkout appended).
- PNN: stochastic greedy choice of next product with probability inversely proportional to distance; checkout appended.
- Evaluation:
- Sampled 10k trajectories per method (upsampling where necessary).
- Compared aggregated occupancy heatmaps and shelf-visit densities to human data using Jensen–Shannon divergence (JSD) and Wasserstein Distance (WD).
- Case study: single impulse-product repositioning derived from each method and evaluated by simulating customer traffic on revised layouts to estimate profit gains.
- Representative quantitative results:
- Trajectory heatmap divergence (lower = better): average JSD — TSP 0.777, PNN 0.676, RL 0.476; average WD — TSP 0.0176, PNN 0.0142, RL 0.00920.
- Shelf-traffic divergence (proportional sampling): JSD — TSP 0.632, PNN 0.549, RL 0.430; WD — TSP 0.313, PNN 0.278, RL 0.217.
- Uniform sampling (comparison to uniformly-sampled human trajectories): RL still closest to human (JSD 0.347; WD 0.00676), though human-human sampling variance exists (human JSD 0.224; WD 0.00517).
- Reproducibility: authors provide code, supplementary material, and a playable digital twin demo.
Implications for AI Economics
- Better behavioural priors for economic models: MaxEnt RL offers a tractable, interpretable way to generate counterfactual consumer paths that capture bounded rationality and multimodality—improving demand exposure estimates and spatially dependent purchase probabilities used in retail economics.
- Cost–accuracy tradeoff: RL can reduce the need for costly sensor deployment while preserving much of the accuracy of ground-truth trajectory-based analysis. This changes the marginal cost calculus for small-to-medium retailers deciding between investing in data collection vs. simulation.
- Improved policy and pricing decisions: More accurate shelf-traffic and impulse-rate estimates support better product-placement, promotion placement, and space-allocation decisions, and thus more reliable profit/consumer surplus calculations in layout optimization.
- Use in counterfactual and welfare analysis: RL-generated realistic trajectories enable counterfactual experiments (e.g., layout changes, routing interventions) for estimation of impacts on sales and consumer travel costs without requiring new field deployments.
- Transferability and calibration caution: Results are from one convenience-store layout and top-61 product categories; economic conclusions require careful calibration to new stores, store formats, cultural/customer-segmentation differences, and sensitivity to conditioning variables (e.g., basket distribution). The paper shows sampling choices (proportional vs uniform) affect divergence metrics—this matters for any economic inference.
- Computational and operational considerations: Training conditional MaxEnt RL agents requires simulation infrastructure (digital twin), compute, and expertise—imposing fixed costs that must be weighed against sensor costs. However, once trained, policies can generate many counterfactuals cheaply.
- Privacy and regulatory aspects: Using simulated behavioural models may reduce privacy risks associated with pervasive in-store tracking, but RL models trained on sensitive trajectory data still require responsible data governance.
- Broader economic modeling potential: The approach can be extended to model other spatially-dependent economic behaviours (tourism flows, urban retail dynamics, labor movement in facilities), enabling richer micro-founded models of location-dependent demand and externalities.
Limitations to note (relevant for economic application): - Single-store empirical validation; generalization to larger supermarkets or different store formats is untested. - Product grouping (categories vs SKUs) may smooth important heterogeneity in demand and visibility effects. - RL requires careful reward and condition design; misspecification can bias simulated behaviour and downstream economic estimates.
Overall, MaxEnt RL appears to be a promising, practical tool for producing behaviorally realistic consumer trajectories that materially improve retail-layout economic analysis compared to common heuristics—potentially lowering barriers for applying spatial-demand methods in practice.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Actual customer trajectories deviate by an average of 28% from shortest paths. Output Quality | negative | high | deviation from shortest paths (trajectory difference) |
28% average deviation
0.18
|
| Reinforcement learning (maximum entropy RL) generated trajectories align more closely with customer behaviour than Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) heuristics. Output Quality | positive | high | alignment/fit between model-generated trajectories and observed customer trajectories |
0.18
|
| RL-based trajectories provide more accurate estimates of impulse purchase rates and shelf traffic densities than TSP and PNN. Firm Revenue | positive | medium | accuracy of estimated impulse purchase rates and shelf traffic densities |
0.11
|
| Only RL-based predictions yield product-repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Firm Revenue | positive | medium | alignment of repositioning decisions and estimated profit gains from repositioning |
comparable estimated profit gains
0.11
|
| Casting customer trajectory prediction as a maximum entropy RL problem balances reward maximization with stochasticity to better reflect customers with bounded rationality. Output Quality | positive | high | behavioral realism / stochasticity of modelled trajectories |
0.03
|
| Real-world trajectory data can provide highly accurate insights but collecting it is costly and often infeasible for many retailers. Adoption Rate | negative | high | feasibility/cost of collecting real-world trajectory data |
0.09
|
| Heuristics such as TSP and PNN are commonly used as inexpensive approximations for customer trajectories. Adoption Rate | null_result | high | use of heuristic methods (TSP, PNN) for trajectory approximation |
0.09
|
| The authors provide source code for their framework on GitHub to encourage further research. Other | null_result | high | availability of implementation/source code |
0.3
|