RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.

Summary

Main Finding

The paper introduces RetailBench, a high-fidelity supermarket MDP benchmark for testing long-horizon autonomous decision-making by LLM-based agents, and proposes the Evolving Strategy & Execution (ESE) framework that separates day-level strategic updates from intra-day action execution. ESE improves operational stability and efficiency relative to reflection- and plan-based baselines, but all evaluated LLMs still perform substantially worse than a hand-crafted heuristic and degrade sharply as environment complexity rises—revealing fundamental limitations of current LLM agents for high-dimensional, long-horizon economic decision tasks.

Key Points

RetailBench: a realistic, multi-component supermarket environment (products, inventory, suppliers, demand signals, news, finances) modeled as an MDP with long horizons (episodes up to >1000 days; termination on five consecutive rent defaults).
Action decomposition: price adjustments, replenishment (supplier selection & quantities), information queries, memory ops, and end-of-day transitions.
Three environment difficulties:
- Easy: 5 categories, no dynamic news or supplier adaptation; budget 10,000, rent 250.
- Middle: 20 categories, no news; budget 50,000, rent 1,000.
- Hard: 20 categories, dynamic news (20 items/day) and time-varying supplier price–quality relationships.
Evolving Strategy & Execution (ESE) framework:
- Two-stage loop: (1) Evolving Strategy — inspect history and possibly update a persistent macro strategy at day granularity; (2) Execution — keep strategy fixed and produce intra-day actions consistent with it.
- Hierarchical policy: Macro Strategy → Execution Strategy (machine-readable) → Daily Actions.
Baselines: step-level Reflection, day-level Reflection, Plan-and-Act, plus a hand-crafted heuristic as an upper bound.
Models evaluated: 8 contemporary LLMs (Qwen-235B, Kimi K2, GLM-4.6, DeepSeek-V3.2-Exp, Gemini-3-Flash-Preview, Grok-4.1 Fast, GPT-5-Mini, GPT-5.2), using lower-cost variants where applicable.
Metrics: operating Days, MaxDays, Avg Daily Sales, Avg Daily Income, Expiry Ratio, Return Ratio — averaged over three rollouts per configuration.

Representative empirical findings (Easy environment, three models summarized in Table 1): - ESE (average of GLM-4.6, Kimi-K2, GPT-5.2): - Avg Days ≈ 62.6; Avg Daily Sales ≈ 297.4; Avg Daily Income ≈ 217.2; Expiry Ratio ≈ 0.0557 - Day-level Reflection (average): Avg Days ≈ 59.1; Sales ≈ 220.4; Income ≈ 190.2; Expiry ≈ 0.0977 - Plan-and-Act (average): Avg Days ≈ 53.7; Sales ≈ 241.8; Income ≈ 137.1; Expiry ≈ 0.0117 - Heuristic upper bound (hand-crafted): Days = 180; Sales ≈ 674; Income ≈ 729 (much higher than any LLM). - Larger models (e.g., GPT-5.2) and models with larger context windows tend to perform better, but still far from the heuristic.

Identified failure modes: - Non-scalable decision-making: models do not proportionally expand coverage of SKUs/categories as environment size grows. - Limited information coverage: agents over-rely on certain signals (supplier price, inventory, historical sales) and underuse others (recent reviews, return rates, current prices), hurting decisions. - Hallucinations and economically irrational behaviors: lead to strategy drift and environment collapse over long horizons. - Performance degrades with environment complexity (Middle→Hard), though some impacts (e.g., news dynamics) manifest with delay.

Data & Methods

Environment construction:
- State components: product-level attributes & historical sales (Dominick’s dataset), inventory, supply-chain state (price–quality relations informed by Grewal et al. 2014), demand signals (customer traffic, reviews), exogenous news (synthesized), financial state (cash, net worth, depreciation).
- Transition pipeline per day: customer traffic sampling → sales realization → reviews & returns → inventory update → financial & exogenous update.
Experimental protocol:
- Three difficulty levels (Easy/Middle/Hard), three independent rollouts per model per setting, metrics averaged.
- Comparisons across four agent frameworks (ESE, day- and step-level Reflection, Plan-and-Act).
- Hand-crafted heuristic with privileged internal state used as approximate upper bound.
- Prompt/specification details and policy representations provided in appendices (authors note token cost constraints required lower-cost variants for some closed-source models).
Analyses:
- Quantitative performance metrics, SKU/category coverage, frequency of querying different information sources per SKU, and manual inspection of trajectories to identify failure modes.

Implications for AI Economics

Limits on autonomous economic agents: Current LLM agents—despite reasoning and tool-augmented capabilities—are not robustly ready to autonomously manage complex, long-horizon economic operations (retail example). They struggle with scalable multi-product allocation, sustained strategy coherence, and full utilization of heterogeneous information.
Importance of strategic stability: ESE’s explicit separation of strategy evolution (lower-frequency updates) and fixed execution reduces oscillations and improves operational stability. For economic automation, architectural constraints that preserve persistent objectives and prevent short-term overfitting are crucial.
Role of model capacity and context window: Larger models and larger context windows improve performance, indicating that economic decision tasks with wide state histories demand more context capacity. This has cost and access implications for firms deploying such systems.
Information incompleteness and externalities: Agents’ tendency to ignore some signals (reviews, returns) can produce economically irrational behaviors, suggesting that system design must ensure access and incentives to use relevant micro- and macro-level signals to avoid negative externalities (stock waste, returns, market shocks).
Human oversight and policy: Given hallucinations and irrational strategy drift seen in long runs, human-in-the-loop approaches and regulatory safeguards remain important for real-world deployment in retail and other economic systems to mitigate systemic risks.
Benchmarking and evaluation: RetailBench provides a realistic benchmark for long-horizon economic autonomy, enabling standardized evaluation of agents’ strategic stability and multi-factor decision-making—useful for both academic progress and industry vetting.
Research directions with economic importance:
- Better hierarchical and modular agent designs that combine economic models (inventory theory, demand estimation) with LLM reasoning.
- Methods to scale decision coverage across high-dimensional action spaces (e.g., attention/selection mechanisms, learned priors over SKU importance).
- Mechanisms to ensure coverage of economically critical signals (structured querying, information incentives).
- Cost-aware deployment studies: trade-offs between computational/context capacity costs and operational gains in revenue/stability.
Practical caution: Organizations seeking to automate retail operations should temper expectations, validate agents against long-horizon benchmarks like RetailBench, and plan for human oversight, model updates, and instrumentation to detect strategy drift or hallucinations before full automation.

If you’d like, I can extract additional numeric highlights (full tables across all eight models and environments) or prepare a short slide-ready summary emphasizing economic risks and deployment recommendations.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic simulation evidence across eight contemporary LLMs and multiple progressively harder environments showing consistent gains from the proposed ESE architecture, but all results come from synthetic, high-fidelity benchmarks rather than field or natural-experiment data, and numeric effect sizes/statistical uncertainty are not presented in the provided summary. Methods Rigormedium — Experimental design is sound in principle—multiple models, baseline comparisons, and stress-test environments that probe non-stationarity and long horizons—but key details are omitted (exact environment specifications, statistical tests/CI, ablations, and deployment constraints), and evaluation is limited to simulated settings without human-in-the-loop or real-world validation. SampleRetailBench: a suite of high-fidelity simulated commercial operations environments (demand-driven inventory/operations) with stochastic demand processes and evolving external conditions; progressively harder tasks to stress long-horizon adaptation and non-stationarity. Eight state-of-the-art LLMs evaluated as agents, compared to monolithic LLM agents and other baseline agent architectures; metrics include operational stability (failure frequency/variance), efficiency (cost/profit/fulfillment), and performance degradation as complexity/non-stationarity increases. Themesproductivity adoption human_ai_collab governance GeneralizabilitySimulation-to-reality gap: synthetic environments may not capture the full complexity, noise, and unmodelled dependencies of live commercial operations., Limited model coverage: results reflect the specific LLMs tested and may not transfer to newer or substantially different models., Domain specificity: benchmark focuses on retail/supply-chain-style decision problems and may not generalize to other sectors or multi-actor market settings., No human-in-the-loop or organizational constraints: experiments omit real workflows, human oversight practices, and institutional constraints that affect deployment., Sensitivity to environment design and reward engineering: performance could depend on simulation parameterization and objective formulations not representative of all real businesses.

Claims (13)

Claim	Direction	Confidence	Outcome	Details
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity). Other	null_result	high	benchmark realism and coverage of non-stationarity for long-horizon decision-making	Introduces RetailBench: high‑fidelity long‑horizon benchmark with stochastic demand and non‑stationarity 0.18
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity). Other	null_result	high	environment difficulty gradient (complexity/stochasticity/non-stationarity levels)	RetailBench environments are progressively more challenging (increasing complexity, stochasticity, non‑stationarity) 0.18
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection). Organizational Efficiency	null_result	high	agent architectural modularity (temporal decomposition into strategy vs execution)	Proposes Evolving Strategy & Execution (ESE): separates high‑level strategic reasoning (slow) from low‑level execution (fast) 0.18
ESE enables interpretable and adaptive strategy updates intended to counteract error accumulation and environmental drift. Decision Quality	positive	medium	interpretability of strategy updates and reduction in error accumulation/strategy drift	ESE enables interpretable and adaptive strategy updates intended to reduce error accumulation and counteract environmental drift 0.11
ESE improves operational stability and efficiency relative to baselines that do not separate strategy from execution. Organizational Efficiency	positive	medium	operational stability (variance/frequency of catastrophic failures) and efficiency (cost/profit/fulfillment metrics)	ESE improves operational stability and efficiency relative to baselines (reported in experiments) 0.11
Agent performance degrades markedly as environment complexity, stochasticity, and non-stationarity increase, revealing core limitations of current LLM-based agents for long-horizon, multi-factor decision problems. Decision Quality	negative	high	overall agent performance across increasing environment complexity (e.g., fulfillment rates, costs, cumulative performance)	Agent performance degrades markedly as environment complexity, stochasticity, and non‑stationarity increase 0.18
Key observed failure modes include error accumulation over long horizons, inability to revise strategy adequately under evolving external conditions, and sensitivity to multi-factor interactions. Decision Quality	negative	medium	frequency and impact of specific failure modes (error accumulation, failed strategy revisions, sensitivity to multi-factor dependencies)	Observed failure modes: long‑horizon error accumulation, insufficient strategy revision, sensitivity to multi‑factor interactions 0.11
Eight state-of-the-art LLMs were evaluated in the study. Other	null_result	high	number of LLMs evaluated (n = 8)	n=8 Eight state‑of‑the‑art LLMs evaluated 0.18
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation. Other	null_result	high	baseline agent architectures used for comparison	Baselines include monolithic LLM agents and architectures without explicit strategy/execution separation 0.18
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity. Organizational Efficiency	null_result	high	operational stability, efficiency, and robustness/degradation metrics	Metrics used: operational stability (variance/frequency of catastrophic failures), efficiency (cost/profit/fulfillment), degradation across complexity 0.18
Modular strategy/execution architectures (like ESE) can materially improve the stability and efficiency of LLM-driven operational decision systems, increasing their attractiveness for deployment in retail, logistics, and supply-chain contexts. Organizational Efficiency	positive	medium	operational stability and efficiency improvements as proxies for deployment attractiveness	Modular strategy/execution architectures (like ESE) can materially improve stability and efficiency, increasing deployment attractiveness in retail/logistics/supply‑chain 0.11
Despite improvements from ESE, current LLM-based agents are not robust enough for fully autonomous long-horizon management in complex, non-stationary commercial environments; human oversight and hybrid systems remain necessary. Decision Quality	negative	medium	robustness to long-horizon non-stationary environments (qualitative and performance-based)	0.11
Recommended research priorities include hierarchical/temporal-decomposition methods, continual learning, robust adaptation to non-stationarity, and causal/structured reasoning to handle multi-factor interactions. Research Productivity	null_result	speculative	suggested research directions to improve robustness (proposed, not empirically validated)	0.02