A behaviourally informed RL model outperforms common heuristics at predicting shopper movement in a convenience store, and produces layout changes that recover the same sales gains as those suggested by real customer paths; the method offers a low-data, practical alternative to costly trajectory collection.

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai · May 18, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A maximum-entropy reinforcement-learning agent-based model generates customer trajectories that more closely match observed paths than TSP and PNN heuristics and produces product-placement recommendations that replicate profit gains estimated from real trajectory data.

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

Summary

Main Finding

Maximum-entropy reinforcement learning (MaxEnt RL) produces simulated in-store customer trajectories that align substantially better with real-world movement than commonly used heuristics (Travelling Salesman Problem and Probabilistic Nearest Neighbour). This improved behavioural fidelity yields more accurate shelf-traffic and impulse-purchase estimates and—critically—leads to layout change recommendations (single-product repositioning) whose profit gains match those produced by ground-truth trajectory data. RL therefore offers a practical, behaviourally grounded middle ground between expensive trajectory collection and oversimplified heuristics.

Key Points

Problem: Real customer trajectories are costly to collect; heuristics (TSP, PNN) are cheap but unrealistic (customers deviate ~28% from shortest paths).
Proposal: Model customers as conditional MaxEnt RL agents trained with PPO to balance reward-seeking and stochasticity (bounded rationality).
Performance: RL trajectories match human data better than TSP and PNN according to Jensen–Shannon divergence (JSD) and Wasserstein distance (WD).
- Example numbers (trajectory heatmaps): JSD — TSP 0.657, PNN 0.580, RL 0.415; WD — TSP 0.0140, PNN 0.0120, RL 0.00800.
Downstream impact: RL-derived trajectories produce more accurate shelf-traffic density maps and impulse-rate estimates (lower JSD/WD to ground truth than TSP/PNN), and only RL-informed product repositioning produced layout choices and estimated profits comparable to using real trajectories.
Practicality: The RL agent is conditioned on basket, checkout location, and optional timestep budget; a digital twin / gridworld (16×36, 50 cm cells) is used for training and rollouts. Source code and demo are publicly available.

Data & Methods

Real-world data:
- Source: Overhead camera array at a convenience store; anonymized 3D joint coordinates at 5 Hz with associated checkout baskets and layout metadata.
- Preprocessed dataset: 3,054 trajectories after alignment to store boundaries, discretization to 2D grid cells, trimming, and mapping pick-ups.
- Focus products: Top 61 best-selling items grouped into 11 categories (cover ~51% of sales); one category per shelf in the model.
Environment:
- Gridworld discretization matching physical store: 16×36 grid, cells = 50×50 cm.
- Two checkout locations, walls/shelves encoded, action space = {forward, turn left, turn right, pickup/checkout}.
RL model:
- MaxEnt objective via PPO; convolutional neural backbone; conditional inputs include basket, checkout, timestep budget, visit-mask, category map, step count, and agent pose.
- Training features: curriculum over basket size (0–5), parallelized environments, per-channel normalization, γ = 1.0 (no trajectory-length discount), bonus for unique-state visits to encourage exploration.
- Trajectory generation: rollouts conditioned on baskets; only trajectories above a reward threshold retained.
Baselines:
- TSP: compute global shortest path visiting required items (checkout appended).
- PNN: stochastic greedy choice of next product with probability inversely proportional to distance; checkout appended.
Evaluation:
- Sampled 10k trajectories per method (upsampling where necessary).
- Compared aggregated occupancy heatmaps and shelf-visit densities to human data using Jensen–Shannon divergence (JSD) and Wasserstein Distance (WD).
- Case study: single impulse-product repositioning derived from each method and evaluated by simulating customer traffic on revised layouts to estimate profit gains.
Representative quantitative results:
- Trajectory heatmap divergence (lower = better): average JSD — TSP 0.777, PNN 0.676, RL 0.476; average WD — TSP 0.0176, PNN 0.0142, RL 0.00920.
- Shelf-traffic divergence (proportional sampling): JSD — TSP 0.632, PNN 0.549, RL 0.430; WD — TSP 0.313, PNN 0.278, RL 0.217.
- Uniform sampling (comparison to uniformly-sampled human trajectories): RL still closest to human (JSD 0.347; WD 0.00676), though human-human sampling variance exists (human JSD 0.224; WD 0.00517).
Reproducibility: authors provide code, supplementary material, and a playable digital twin demo.

Implications for AI Economics

Better behavioural priors for economic models: MaxEnt RL offers a tractable, interpretable way to generate counterfactual consumer paths that capture bounded rationality and multimodality—improving demand exposure estimates and spatially dependent purchase probabilities used in retail economics.
Cost–accuracy tradeoff: RL can reduce the need for costly sensor deployment while preserving much of the accuracy of ground-truth trajectory-based analysis. This changes the marginal cost calculus for small-to-medium retailers deciding between investing in data collection vs. simulation.
Improved policy and pricing decisions: More accurate shelf-traffic and impulse-rate estimates support better product-placement, promotion placement, and space-allocation decisions, and thus more reliable profit/consumer surplus calculations in layout optimization.
Use in counterfactual and welfare analysis: RL-generated realistic trajectories enable counterfactual experiments (e.g., layout changes, routing interventions) for estimation of impacts on sales and consumer travel costs without requiring new field deployments.
Transferability and calibration caution: Results are from one convenience-store layout and top-61 product categories; economic conclusions require careful calibration to new stores, store formats, cultural/customer-segmentation differences, and sensitivity to conditioning variables (e.g., basket distribution). The paper shows sampling choices (proportional vs uniform) affect divergence metrics—this matters for any economic inference.
Computational and operational considerations: Training conditional MaxEnt RL agents requires simulation infrastructure (digital twin), compute, and expertise—imposing fixed costs that must be weighed against sensor costs. However, once trained, policies can generate many counterfactuals cheaply.
Privacy and regulatory aspects: Using simulated behavioural models may reduce privacy risks associated with pervasive in-store tracking, but RL models trained on sensitive trajectory data still require responsible data governance.
Broader economic modeling potential: The approach can be extended to model other spatially-dependent economic behaviours (tourism flows, urban retail dynamics, labor movement in facilities), enabling richer micro-founded models of location-dependent demand and externalities.

Limitations to note (relevant for economic application): - Single-store empirical validation; generalization to larger supermarkets or different store formats is untested. - Product grouping (categories vs SKUs) may smooth important heterogeneity in demand and visibility effects. - RL requires careful reward and condition design; misspecification can bias simulated behaviour and downstream economic estimates.

Overall, MaxEnt RL appears to be a promising, practical tool for producing behaviorally realistic consumer trajectories that materially improve retail-layout economic analysis compared to common heuristics—potentially lowering barriers for applying spatial-demand methods in practice.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Uses real-world trajectory data and directly compares model outputs to observed behaviour and to simple heuristics, providing credible evidence that the RL approach improves predictive accuracy and yields similar layout recommendations; however evidence is limited by reliance on a single convenience-store dataset, unclear sample size/time coverage, absence of randomized or quasi-experimental variation, and possible overfitting/tuning to the specific environment. Methods Rigormedium — The paper develops a principled maximum-entropy RL framework and evaluates it against standard heuristics using empirical trajectories, and shares code; but methodological concerns remain about parameter calibration, robustness checks, sensitivity to hyperparameters and environment specification, external validity across store types and scales, and limited detail (in the provided summary) on training/validation splits and statistical uncertainty of comparisons. SampleAn anonymized real-world customer trajectory dataset collected from a single convenience store (position/time traces and purchase records), which was used to train and test the agent-based maximum-entropy RL model and to compute baseline metrics and layout-repositioning benchmarks; exact sample size, time span, and sensor setup are not specified in the summary. Themesorg_design productivity IdentificationNo formal causal identification; evaluation is based on out-of-sample predictive performance (fit of generated trajectories to observed trajectories), comparison of downstream metrics (impulse purchase rates, shelf traffic densities) computed from simulated vs observed trajectories, and whether layout/repositioning recommendations derived from RL match those derived from actual trajectory data. GeneralizabilitySingle-store convenience-store dataset may not represent other retail formats (supermarkets, big-box, specialty stores)., Local customer demographics, culture, and shopping habits may limit transferability to other regions or countries., Store layout, product assortment, and shelf sizes differ across retailers, potentially requiring re-training/recalibration., Sensor technologies and data quality vary; approach may depend on data frequency/accuracy available., Model performance may change with store footfall volume and shopper heterogeneity (e.g., planned vs impulse shopping)., Scalability to larger or multi-floor environments and to interacting shoppers is untested.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Actual customer trajectories deviate by an average of 28% from shortest paths. Output Quality	negative	high	deviation from shortest paths (trajectory difference)	28% average deviation 0.18
Reinforcement learning (maximum entropy RL) generated trajectories align more closely with customer behaviour than Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) heuristics. Output Quality	positive	high	alignment/fit between model-generated trajectories and observed customer trajectories	0.18
RL-based trajectories provide more accurate estimates of impulse purchase rates and shelf traffic densities than TSP and PNN. Firm Revenue	positive	medium	accuracy of estimated impulse purchase rates and shelf traffic densities	0.11
Only RL-based predictions yield product-repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Firm Revenue	positive	medium	alignment of repositioning decisions and estimated profit gains from repositioning	comparable estimated profit gains 0.11
Casting customer trajectory prediction as a maximum entropy RL problem balances reward maximization with stochasticity to better reflect customers with bounded rationality. Output Quality	positive	high	behavioral realism / stochasticity of modelled trajectories	0.03
Real-world trajectory data can provide highly accurate insights but collecting it is costly and often infeasible for many retailers. Adoption Rate	negative	high	feasibility/cost of collecting real-world trajectory data	0.09
Heuristics such as TSP and PNN are commonly used as inexpensive approximations for customer trajectories. Adoption Rate	null_result	high	use of heuristic methods (TSP, PNN) for trajectory approximation	0.09
The authors provide source code for their framework on GitHub to encourage further research. Other	null_result	high	availability of implementation/source code	0.3