A modular agentic AI system, Flowr, cut manual coordination and tightened demand-supply alignment in a supermarket-chain pilot by automating decision-intensive supply-chain tasks while keeping managers in the loop; findings are promising but rest on a single-firm deployment without rigorous causal controls.

Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

Eranga Bandara, Ross Gore, Sachin Shetty, Piumi Siyambalapitiya, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Ravi Mukkamala, Peter Foytik, Safdar H. Bouk, Abdul Rahman, Xueping Liang, Amin Hass, Tharaka Hewa, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan · April 07, 2026

arxiv descriptive low evidence 8/10 relevance Source PDF

Flowr is an agentic AI framework that decomposes retail supply-chain work into specialized AI agents coordinated with human-in-the-loop oversight, and a corporate pilot reportedly reduced manual coordination, improved demand-supply alignment, and enabled proactive exception handling.

Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment, processes that are repetitive, decision-intensive, and difficult to scale without significant human effort. Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks. This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations. Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination. To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM. Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control. Evaluation demonstrates that Flowr significantly reduces manual coordination overhead, improves demand-supply alignment, and enables proactive exception handling at a scale unachievable through manual processes. The framework was validated in collaboration with a large-scale supermarket chain and is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.

Summary

Main Finding

Flowr is an agentic-AI framework that decomposes end-to-end retail replenishment workflows into a network of specialized, LLM-powered agents coordinated by a central reasoning LLM and a Model Context Protocol (MCP) orchestration layer. In a proof-of-concept with a large supermarket chain, Flowr materially reduced manual coordination overhead, improved demand–supply alignment across outlets, and enabled proactive exception handling while preserving human oversight and explainability.

Key Points

Architecture: Multi-agent design where each agent has a clearly defined cognitive role (demand sensing/forecasting, inventory monitoring, procurement, supplier coordination, distribution-center replenishment, exception handling).
LLM Consortium: Domain-specialized, fine-tuned LLMs (via LoRA/QLoRA) produce candidate outputs; a central reasoning LLM (GPT-OSS in the paper) synthesizes and validates outputs to reduce model-specific bias and hallucination.
Human-in-the-loop: Human supervisors oversee checkpoints via an MCP-enabled interface; no high‑stakes action executes without explicit human approval.
Integration standard: Model Context Protocol (MCP) exposes agent workflows as modular servers, enabling secure, interoperable access to inventory systems, supplier APIs, and human orchestration through a unified UI (LM Studio).
Responsible & explainable AI: Structured outputs include reasoning traces and supplier/allocation rationales; interaction logs and approval gates enable auditability.
Deployment approach: Fine-tuning on curated retail datasets, parameter-efficient adapters (LoRA), 4-bit quantization (QLoRA), and local inference via Ollama to support air‑gapped or low-latency environments.
Claimed outcomes: Significant reductions in manual coordination effort, better alignment of supply to heterogeneous outlet demand, and scalable proactive exception handling. (Paper gives a real-world PoC but excerpt does not include detailed numeric results.)

Data & Methods

Data used for fine-tuning and evaluation:
- Historical sales/POS records (outlet-level, time series)
- Inventory records across outlets and distribution centers
- Procurement/order histories and supplier interaction logs
- External signals (seasonality, market disruptions) — as available
Model stack and training:
- Base LLMs considered: Llama-3, Mistral, Qwen (examples cited)
- Fine-tuning: supervised fine-tuning on domain dataset; LoRA adapters for parameter efficiency; QLoRA (4-bit) for memory reduction
- Ensemble/consortium: multiple fine-tuned domain models + central reasoning LLM (GPT-OSS) for synthesis/verification
Deployment & integration:
- Models deployed locally (Ollama) for low latency and air‑gapped settings
- Each agent exposed via MCP server for modular access to enterprise systems/APIs
- Operators interact through LM Studio UI; agents call external systems (inventory DBs, supplier APIs) via MCP endpoints
Workflow & validation:
- Agents autonomously generate and iterate on actions (e.g., propose purchase order, supplier selections, DC allocations); outputs carry rationale and are routed to human supervisors at defined gates
- Structured logging and reasoning traces recorded for audit and offline learning
Evaluation:
- Proof-of-concept implemented with a large supermarket chain; reported qualitative and operational improvements (reduced coordination overhead, improved alignment, proactive exception handling)
- The excerpt does not provide quantitative metrics (e.g., % reduction in manual hours, inventory turns, stockout rates, or cost savings). The paper references Section 6 for evaluation results.

Implications for AI Economics

Labor and task reallocation:
- Substitution of routine coordination and repetitive decision tasks by agents; human roles shift toward oversight, exception handling, strategic planning, and governance.
- Potential short‑term displacement in coordination roles, offset by demand for higher-skill supervisors, data/AI governance staff, and integration engineers.
Productivity and cost structure:
- Lower transaction and coordination costs across distributed outlets; potential reductions in stockouts, overstocks, and perishable waste (improves inventory efficiency and reduces working-capital needs).
- One-off and ongoing costs: model fine-tuning, secure local deployment, MCP integration, and governance infrastructure—capital investments that may favor larger chains (scale economies).
Market structure and bargaining:
- Improved supplier coordination could strengthen retail bargaining and enable more efficient multi-echelon allocation; conversely, large adopters may gain a competitive edge, raising barriers to entry and potentially concentrating market power.
- Suppliers may face pressure to automate their side of negotiations and integrate APIs, shifting surplus.
Measurement and valuation:
- Traditional productivity metrics may undercount quality improvements (service-level gains, reduced spoilage). Economists should track inventory turns, fill rates, spoilage rates, labor hours by task, and procurement cycle times to value effects.
- Returns to scale: benefits likely increase with outlet count, supplier breadth, and historical data richness—favoring incumbents with larger datasets.
Risk, governance, and externalities:
- Model errors, data biases, and coordination failures carry operational and financial risk; the human-in-loop design mitigates but does not eliminate these risks.
- Investment in governance, audit trails, and explainability imposes ongoing costs but is necessary to manage regulatory, contractual, and reputational risks.
Generalizability and diffusion:
- Flowr’s domain-independent blueprint suggests broad applicability across other high-volume, distributed enterprise workflows (pharma distribution, manufacturing MRP, foodservice supply chains), implying cross-industry productivity potentials.
- Adoption depends on integration costs, data readiness, regulatory environment (privacy, procurement rules), and labor market adjustments.
Research and policy implications:
- Need for causal evaluation (RCTs, difference-in-differences) to quantify impacts on employment, supplier welfare, retail margins, consumer prices, and resilience to shocks.
- Policymakers should consider retraining programs, standards for auditability/explainability, and competition policy to monitor concentration effects.

Limitations and open questions (economically important) - Missing granular quantitative results in the provided excerpt — causal magnitude of productivity gains and labor effects require fuller empirical reporting. - Dependence on high-quality, integrated data and on-premise compute may raise entry costs and create lock-in. - Potential second-order effects (price pass-through to consumers, supplier margin compression, changes in upstream investment) remain to be measured.

If you want, I can: - Extract concrete economic metrics to estimate potential cost savings (given assumptions about labor hours, stockout rates, and inventory costs), or - Draft an empirical evaluation plan (metrics and identification strategy) to measure Flowr’s causal impact on productivity, employment, and welfare.

Assessment

Paper Typedescriptive Evidence Strengthlow — Claims are based on an operational validation with a single corporate partner and descriptive performance improvements; the paper does not report a counterfactual, randomization, pre-registered outcomes, or robust controls for confounders, so causal attribution to Flowr is weak and vulnerable to selection, implementation, and concurrent-change biases. Methods Rigorlow — The methodology is primarily systems design plus a case-study evaluation: sample size, deployment scope, time horizon, metrics, measurement protocols, and statistical tests are not clearly specified; there is no experimental or quasi-experimental identification strategy and limited detail on robustness checks or sensitivity analyses. SampleValidation conducted in collaboration with a single large-scale supermarket chain; the paper reports results from a production pilot across the firm's supply chain operations (outlets, distribution centers, supplier coordination), but specific counts of stores/DCs, time period, and quantitative sample sizes are not reported. Themesproductivity human_ai_collab IdentificationNo formal causal identification reported; evaluation appears to rely on a pilot/case-study deployment with before-after and operational comparisons within a partner supermarket chain rather than randomized assignment, difference-in-differences, instrumental variables, or other quasi-experimental controls. GeneralizabilitySingle-firm case study limits external validity to other retailers or industries, Proprietary systems, data quality, and engineering integration at the partner chain may not generalize, Unreported selection and implementation choices (which sites/processes were chosen) create selection bias, Short-term pilot effects may not persist long-run (learning, vendor lock-in, maintenance costs), Outcomes focus on operational metrics (coordination overhead, demand-supply alignment) not economy-wide labor or wage effects, Context-specific supplier relationships, regulation, and labor practices may alter transferability

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment. Automation Exposure	negative	high	degree of manual operations / automation exposure	0.09
Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks. Organizational Efficiency	negative	high	degree of manual decision-making and coordination (fragmentation/reactivity)	0.09
This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations. Task Allocation	positive	high	ability to automate end-to-end supply chain workflows (task allocation to AI)	0.03
Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination. Task Allocation	positive	high	task decomposition and automation of previously human-coordinated processes	0.03
To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM. Decision Quality	positive	high	task accuracy and adherence to responsible AI principles	0.03
Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control. Organizational Efficiency	positive	high	preservation of accountability and organizational control during automation	0.03
Evaluation demonstrates that Flowr significantly reduces manual coordination overhead. Organizational Efficiency	positive	high	manual coordination overhead (effort/time/coordination burden)	0.18
Evaluation shows Flowr improves demand–supply alignment. Firm Productivity	positive	high	demand–supply alignment	0.18
Evaluation indicates Flowr enables proactive exception handling at a scale unachievable through manual processes. Organizational Efficiency	positive	high	proactive exception handling capability and scale	0.18
The framework was validated in collaboration with a large-scale supermarket chain. Adoption Rate	positive	high	field validation / real-world deployment	n=1 0.09
Flowr is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings. Adoption Rate	positive	high	generalizability / applicability across domains	0.09