The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

A probe-then-plan LLM system deployed at JD.com lifts relevant recall and conversion, producing measurable increases in gross merchandise value; lightweight retrieval probes let the planner ground decisions in live inventory without incurring prohibitive latency.

Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
Mengxiang Chen, Zhouwei Zhai, Jin Li · March 16, 2026
arxiv rct medium evidence 8/10 relevance Source PDF
EASP, a probe-then-plan environment-aware LLM architecture, uses lightweight retrieval probes, teacher-synthesized training data, and SFT+RL alignment to improve search relevance and raises UCVR and GMV in JD.com's production A/B tests while meeting latency constraints.

Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on JD.com demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in JD.com's AI-Search system.

Summary

Main Finding

Environment-Aware Search Planning (EASP) resolves the blindness-latency dilemma in LLM-based e-commerce search by grounding planning in a real-time retrieval snapshot. Using a lightweight "Probe-then-Plan" workflow, EASP produces execution-valid search plans with sub-second serving latency. Offline evaluations and online A/B testing at JD.com show substantial increases in relevant recall and meaningful lifts in user conversion (UCVR) and gross merchandise value (GMV). EASP has been deployed in JD.com's production AI-Search system.

Key Points

  • Problem: Existing LLM paradigms face a blindness-latency tradeoff
    • Query rewriting is blind to retrieval capabilities and live inventory, producing plans that cannot be executed.
    • Deep search agents that call retrieval tools iteratively add seconds of latency—unacceptable for industrial sub-second SLAs.
  • Solution: Environment-Aware Search Planning (EASP)
    • Reformulates search planning as dynamic reasoning grounded in an explicitly probed environment.
    • Probe-then-Plan mechanism:
      • Retrieval Probe: a lightweight call that returns a retrieval snapshot (what the current retrieval stack and inventory would yield).
      • Planner: consumes that snapshot, diagnoses execution gaps, and outputs grounded, executable search plans.
  • System lifecycle (three stages):
  • Offline Data Synthesis: A Teacher Agent diagnoses probed environments and synthesizes diverse, execution-validated plans to build training data.
  • Planner Training & Alignment: Planner is first trained by supervised fine-tuning (SFT) to learn diagnostic planning, then aligned to business objectives (conversion) via reinforcement learning.
  • Adaptive Online Serving: Complexity-aware routing selectively invokes the Planner only for complex queries to balance effectiveness and latency/cost.
  • Production impact: Improves relevant recall and drives higher conversion and GMV in A/B tests; successfully integrated into JD.com’s AI-Search pipeline.

Data & Methods

  • Inputs and signals:
    • Lightweight retrieval probe snapshots capturing what the retrieval system would return given the live index and inventory constraints.
    • Query logs and contextual signals used to detect complexity and routing decisions.
  • Offline data synthesis:
    • A Teacher Agent inspects probe snapshots and generates multiple candidate, execution-validated search plans. These plans reflect real retrieval constraints to avoid infeasible instructions.
  • Model training:
    • Supervised Fine-Tuning (SFT) on teacher-generated, execution-validated plans to teach diagnostic reasoning and grounded plan generation.
    • Reinforcement Learning (RL) to align planner outputs with business metrics (explicitly using conversion/UCVR as the objective signal).
  • Serving architecture:
    • Probe is implemented as a lightweight retrieval call to minimize added latency.
    • Complexity-aware routing classifier decides which queries warrant full planning; lower-complexity queries use cheaper heuristics to preserve sub-second response times.
  • Evaluation:
    • Offline metrics: relevant recall and execution validity of generated plans.
    • Online metrics: UCVR (user conversion rate) and GMV measured via A/B testing on the JD.com platform.
  • Deployment constraints:
    • Must respect stringent latency budgets; selective invocation and a minimal probe footprint enable production viability.

Implications for AI Economics

  • Platform-level revenue and efficiency:
    • Environment-aware planning improves matching quality, increasing conversion and GMV—directly affecting platform revenue and marketplace liquidity.
    • Reducing invalid or infeasible plans decreases wasted impressions and improves the return on ranking/retrieval computation.
  • Cost–benefit and resource allocation:
    • Complexity-aware routing demonstrates that selective use of expensive AI planning can yield high marginal returns while keeping operational costs and latency low.
    • The Probe-then-Plan paradigm monetizes a small upfront retrieval cost that prevents larger downstream inefficiencies.
  • Incentives and market dynamics:
    • Better execution-valid recommendations change seller exposure and demand patterns; this can shift competitive dynamics and advertising auction equilibria.
    • Aligning planners with conversion metrics via RL optimizes for platform objectives but raises questions about externalities (e.g., promotion bias toward higher-margin items).
  • Generalizability and policy considerations:
    • The EASP approach is applicable to other marketplaces and real-time decision systems that require grounding plans in live environments (inventory, bids, constraints).
    • Economic evaluation should include second-order effects (seller strategy, fairness, search neutrality) and measure consumer surplus, not just platform GMV.
  • Research takeaway:
    • Incorporating environment probes into reasoning pipelines is a high-leverage design pattern: it reduces model blind spots at low latency cost and enables principled alignment of model behavior with economic objectives.

Assessment

Paper Typerct Evidence Strengthmedium — The study includes production A/B tests measuring platform-level outcomes (UCVR, GMV), which is the strongest lever for causal claims, but the paper as summarized lacks important experimental details (randomization protocol, sample sizes, duration, balance checks, significance reporting, and spillover controls). Offline synthesized data and RL alignment add supportive evidence but introduce risks of overfitting or reward specification bias. Methods Rigormedium — The engineering and methodological pipeline is sophisticated—probe-then-plan architecture, teacher-synthesized execution-validated training data, SFT plus RL fine-tuning, and complexity-aware routing—but the summary does not report key methodological diagnostics (robustness checks, sensitivity analyses, ablations isolating probe vs planner vs routing effects, or metrics for latency/throughput tradeoffs), limiting reproducibility assessment. SampleProduction JD.com search traffic and logs (queries, retrieval snapshots, inventory state); offline datasets synthesized by a Teacher Agent that validates execution of plans against retrieval snapshots; online A/B test populations of JD.com users/requests (scale and period not specified); optimization uses conversion (UCVR) as reward signal for RL alignment. Themesproductivity adoption IdentificationCausal impact on business outcomes is identified via online randomized A/B testing in production (users/requests routed between EASP and baseline); supporting evidence comes from offline evaluation on synthesized, execution-validated datasets produced by a Teacher Agent and counterfactual-style comparisons of retrieval/recall metrics. Planner performance is also improved via supervised fine-tuning and reinforcement learning aligned to conversion rate. GeneralizabilitySingle-platform evidence (JD.com) — results may not transfer to other e-commerce platforms with different inventory dynamics or user behavior, Language and locale specificity (Chinese marketplace) may limit transfer to other languages/markets, Dependent on JD.com's retrieval stack, latency budget, and routing infrastructure — other systems may have different cost/benefit tradeoffs, RL alignment to platform-specific conversion metrics may overfit to JD's business objectives and incentives, Reported improvements are short-to-medium term; long-run effects (user learning, seller responses, fairness) are not addressed

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Environment-Aware Search Planning (EASP) resolves the blindness-latency dilemma in LLM-based e-commerce search by grounding planning in the real retrieval environment via a Probe-then-Plan mechanism. Other positive medium ability to produce environment-grounded search plans that address execution gaps (qualitative/behavioral outcome)
0.36
The Probe-then-Plan mechanism uses a lightweight Retrieval Probe to expose the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. Other null_result high retrieval snapshot exposure and Planner diagnostic output (implementation/functional outcome)
0.6
EASP's Offline Data Synthesis stage: a Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. Other null_result high synthesized execution-validated search plans (data generation outcome)
0.6
The Planner is trained via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities and then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). Other null_result high Planner diagnostic behavior and policy alignment with conversion rate (model training outcome)
0.6
A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation during online serving. Other null_result medium selective activation of planning (system routing/resource allocation outcome)
0.36
Extensive offline evaluations and online A/B testing on JD.com show that EASP significantly improves relevant recall. Output Quality positive medium relevant recall (retrieval effectiveness metric)
0.36
Online A/B testing on JD.com demonstrates that EASP achieves substantial lifts in UCVR (user conversion rate) and GMV (gross merchandise volume). Firm Revenue positive medium UCVR (user conversion rate) and GMV (gross merchandise volume)
0.36
EASP has been successfully deployed in JD.com's AI-Search system. Adoption Rate positive medium deployment status in production (operational outcome)
0.36
EASP offers a practical tradeoff between reasoning quality and latency by avoiding iterative LLM tool-calls at inference time while still producing grounded plans. Task Completion Time positive medium inference latency vs. reasoning/plan validity tradeoff (system performance outcome)
0.36

Notes