AIGQ’s generative query suggestions boost Taobao engagement: an interest-aware training and policy-optimization pipeline raises click-through and other business metrics in large-scale randomized homepage tests.

AIGQ: An End-to-End Hybrid Generative Architecture for E-commerce Query Recommendation

Jingcao Xu, Jianyun Zou, Renkai Yang, Zili Geng, Qiang Liu, Haihong Tang · March 20, 2026

arxiv rct high evidence 7/10 relevance Source PDF

AIGQ, an end-to-end generative framework for pre-search query recommendations on Taobao, uses interest-aware list fine-tuning and a dual-reward policy optimizer to significantly improve click-through and key platform engagement metrics in large-scale randomized experiments.

Pre-search query recommendation, widely known as HintQ on Taobao's homepage, plays a vital role in intent capture and demand discovery, yet traditional methods suffer from shallow semantics, poor cold-start performance and low serendipity due to reliance on ID-based matching and co-click heuristics. To overcome these challenges, we propose AIGQ (AI-Generated Query architecture), the first end-to-end generative framework for HintQ scenario. AIGQ is built upon three core innovations spanning training paradigm, policy optimization and deployment architecture. First, we propose Interest-Aware List Supervised Fine-Tuning (IL-SFT), a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking strategy to faithfully model nuanced user intent. Accordingly, we design Interest-aware List Group Relative Policy Optimization (IL-GRPO), a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties, enhanced by a model-based reward from the online click-through rate (CTR) ranking model. To deploy under strict real-time and low-latency requirements, we further develop a hybrid offline-online architecture comprising AIGQ-Direct for nearline personalized user-to-query generation and AIGQ-Think, a reasoning-enhanced variant that produces trigger-to-query mappings to enrich interest diversity. Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement.

Summary

Main Finding

AIGQ introduces the first end-to-end generative architecture deployed for pre-search (HintQ) query recommendation in e-commerce. By co-designing list-level supervised fine-tuning, a list-aware RL policy optimization algorithm, and a hybrid offline–online deployment, AIGQ achieves materially better personalization, cold-start generalization and serendipity than ID/co-click based baselines while meeting strict production latency and cost constraints. Large-scale online A/B tests on Taobao report consistent, substantial lifts in platform effectiveness and user engagement and the highest PVR contribution among competing recalls.

Key Points

Problem & setting
- HintQ: personalized, pre-search query recommendation on Taobao’s homepage with no active user query; input = user profile + behavior sequence; output = ranked list of K hint queries.
- Challenges: need to balance personalization accuracy and diversity/serendipity, cold-start generalization, and meet tight online latency/compute budgets.
Core contributions
- Interest-Aware List Supervised Fine-Tuning (IL-SFT): treats target as an ordered list (not just next-token prediction) built from session-aware behavior aggregation and interest-guided re-ranking to better reflect true user interest strength.
- Interest-aware List Group Relative Policy Optimization (IL-GRPO): an RL policy-gradient adaptation that computes advantage at the individual-query-in-list granularity and uses a dual-component reward (local query-level + global sequence-level), augmented by a model-based reward from an online CTR ranker.
- Hybrid offline–online deployment: two LLM variants
  - AIGQ-Direct: lightweight nearline personalized user-to-query (u2q) generator for fast, per-user generation.
  - AIGQ-Think: reasoning-enhanced, offline trigger-to-query (x2q) mapping generator that expands interest diversity and provides triggers for refinement.
- Engineering optimizations: item-to-text generator (to map items to textual surrogates), prompt compression (special tokens for structured fields and instruction pruning), chain-of-thought style reasoning distillation (Qwen3-32B teacher) for AIGQ-Think, and caching strategies (u2q / x2q) to meet latency.
Empirical outcome
- Deployed at scale on Taobao; reported substantial improvements in user engagement and business metrics (CTR, PVR contribution, etc.). The paper emphasizes consistent online gains though exact numbers are not provided in the excerpt.

Data & Methods

Data
- Large-scale industrial logs from Taobao: time-ordered user interaction histories (searches, item clicks, exposures, hint-query clicks), user profile attributes.
- Training labels constructed by combining system priors (production ranking scores) and feedback priors (clicks), plus LLM-generated candidates for coverage expansion (with LLM filtering).
Sample construction
- Two-stage unified pipeline:
- Session-aware behavior aggregation: - AIGQ-Direct: session = a single HintQ exposure event (kept if at least one of the three displayed queries was clicked). - AIGQ-Think: session = day-level window from first search entry to exit (aggregates cross-domain behaviors to discover diverse interests).
- Interest-guided label re-ranking: assemble and order candidate queries according to interest strength (clicks highest, then global searches, production-ranked queries, LLM-generated candidates).
Supervised training (IL-SFT)
- Treats the output as an ordered list z = [q1,...,qT]; loss is list log-likelihood: sum over t log P(zt | z<t, x; θ).
- AIGQ-Direct: flat top-K list supervised from production and click signals.
- AIGQ-Think: structured outputs (trigger → query lists) with explicit reasoning traces rk generated by a teacher LLM; distillation dataset includes (context, rationale, structured output).
Reinforcement learning (IL-GRPO)
- Extension of GRPO adapted for ordered list generation:
  - Fine-grained advantage estimation per query within a generated list.
  - Dual-component reward:
    - Local query-level reward: assesses individual query relevance/quality (e.g., ROUGE-L, length/format penalties, CTR proxy).
    - Global sequence-level reward: evaluates list coherence, coverage, diversity, repetition penalties.
  - Model-based reward augmentation: uses an online CTR ranking model to produce additional reward signal aligning generation with short-term engagement.
  - Practical RL techniques: entropy-adaptive clipping, dynamic entropy regularization, and rollout-based sequence evaluation to handle combinatorial list dependencies.
Architecture & inference
- Item-to-text generator: fine-tuned Qwen3-32B to produce compact textual surrogates for item metadata.
- Prompt compression: special tokens for short encodings (<1_day_ago>, , etc.) and pruning redundant instructions.
- Hybrid deployment:
  - AIGQ-Direct runs nearline to produce u2q cache (personalized).
  - AIGQ-Think runs offline to produce trigger-to-query x2q mappings (diversification).
  - Online system composes/refines results from caches and other retrievals to meet strict latency budgets.

Implications for AI Economics

Increased monetization potential and platform value
- Better pre-search intent capture (improved CTR, PVR contribution) likely increases downstream conversions and GMV by surfacing queries that lead to more relevant search sessions and purchases.
- Higher serendipity/diversity can increase user engagement, session length, and lifetime value—important economic levers for marketplaces.
Cost–benefit and operational economics
- LLM adoption usually raises inference cost; AIGQ’s hybrid architecture (nearline u2q + offline x2q + caching) mitigates per-request compute, enabling practical economics for high-throughput platforms.
- Item-to-text generation and prompt compression reduce input/output token counts, lowering inference cost further.
- The model-based CTR reward aligns generation to short-term monetizable signals, improving returns on model training/deployment investment.
Market-level and strategic effects
- Improved hint-query quality can reshape user search behaviors and discovery patterns, potentially increasing competition among merchants for discoverability.
- Changes in what users are recommended may shift demand distribution across products and categories; platform needs to manage marketplace fairness and merchant incentives.
Risks, externalities, and governance
- Behavioral shaping: proactively generated queries can bias users toward certain categories/products—requiring careful monitoring to avoid over-concentration and to preserve long-term user trust.
- Feedback loops: using CTR-model rewards and production signals risks reinforcing short-term engagement biases; balancing exploration (serendipity) vs exploitation is economically important.
- Privacy and data governance: heavy reliance on personal behavior logs mandates robust privacy, consent and compliance safeguards—nontrivial economic and regulatory costs.
- Fairness and merchant impact: algorithmic changes can advantage certain sellers/categories; platforms may need compensation, transparency or marketplace governance mechanisms.
Research and product directions with economic relevance
- Quantify downstream GMV and long-term retention lift attributable to list-level generative recommendations (causal experiments).
- Measure cost-per-incremental-GMV for LLM-based recall vs traditional retrieval to validate ROI.
- Explore dynamic pricing of promoted exposure slots if generated queries change demand patterns.
- Develop safe-reward designs that trade off short-term monetization vs long-term user value, to prevent economically harmful optimization.

Summary AIGQ shows how list-level generative modeling plus list-aware RL and a pragmatic hybrid deployment can make LLM-driven query recommendation economically viable and beneficial in a large e-commerce platform. For platforms considering LLMs, AIGQ illustrates key design levers—list supervision, model-based reward alignment, reasoning distillation, and offline/on‑line caching—that jointly govern the trade-offs between user utility, monetization, and operational cost.

Assessment

Paper Typerct Evidence Strengthhigh — The paper reports large-scale, in-production randomized A/B tests on the Taobao platform, which provide direct causal evidence of AIGQ's impact on user engagement and business metrics; offline evaluations and ablation-style comparisons of modeling components further support the findings. Remaining threats are typical platform-experiment concerns (spillovers, short-run measurement, proprietary metric dependence) but do not invalidate the core randomized comparisons. Methods Rigormedium — The paper proposes novel training (IL-SFT) and policy-optimization (IL-GRPO) methods and describes a realistic deployment architecture; however, the description (as summarized) leaves open questions about hyperparameters, robustness checks, ablation completeness, statistical reporting (e.g., confidence intervals, pre-registration), and reproducibility given reliance on a proprietary CTR model and production infrastructure. SampleProprietary Taobao data: aggregated session histories and click logs used for offline training and evaluation; a large population of Taobao homepage users and impressions randomized in online A/B experiments; model-based reward signals come from the platform's online CTR ranking model; exact sample sizes and demographic breakdowns are not specified in the summary. Themesadoption innovation IdentificationLarge-scale online randomized A/B experiments on Taobao homepage comparing the AIGQ system to baseline query-recommendation methods; causal inference relies on random assignment of users/impressions to treatment and control and measurement of downstream business metrics (CTR, engagement, platform effectiveness). GeneralizabilityResults are platform-specific (Taobao) and may not generalize to other e-commerce sites or non-commerce contexts., User population is region/language-specific (Taobao users, primarily Chinese shoppers), limiting cross-cultural transfer., Relies on a platform's proprietary CTR/ranking model and large-scale infrastructure; small platforms or those without strong CTR models may see different gains., Optimization tailored to pre-search HintQ/homepage query use-case; other recommendation surfaces (search results, ads) may behave differently., Potential short-term experiment horizon; long-run effects on user behavior and retention are unclear.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
AIGQ is the first end-to-end generative framework for the HintQ (pre-search query recommendation) scenario. Innovation Output	positive	high	innovation_output	0.1
Interest-Aware List Supervised Fine-Tuning (IL-SFT) is a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking to faithfully model nuanced user intent. Decision Quality	positive	high	modeling of user intent (nuanced intent capture)	0.6
Interest-aware List Group Relative Policy Optimization (IL-GRPO) is a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties. Output Quality	positive	high	individual query relevance and global list properties	0.6
IL-GRPO is enhanced by a model-based reward from the online click-through rate (CTR) ranking model. Output Quality	positive	high	optimization quality via CTR-informed reward	0.6
A hybrid offline-online deployment architecture composed of AIGQ-Direct (nearline personalized user-to-query generation) and AIGQ-Think (reasoning-enhanced trigger-to-query mappings) enables meeting strict real-time and low-latency requirements while enriching interest diversity. Organizational Efficiency	positive	high	real-time/low-latency deployment and interest diversity	0.6
Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement. Adoption Rate	positive	high	platform effectiveness and user engagement (key business metrics)	0.6
AIGQ overcomes limitations of traditional HintQ methods (shallow semantics, poor cold-start performance, and low serendipity) that arise from reliance on ID-based matching and co-click heuristics. Output Quality	positive	high	cold-start performance, semantic richness, serendipity of recommended queries	0.6