AIGQ’s generative query suggestions boost Taobao engagement: an interest-aware training and policy-optimization pipeline raises click-through and other business metrics in large-scale randomized homepage tests.
Pre-search query recommendation, widely known as HintQ on Taobao's homepage, plays a vital role in intent capture and demand discovery, yet traditional methods suffer from shallow semantics, poor cold-start performance and low serendipity due to reliance on ID-based matching and co-click heuristics. To overcome these challenges, we propose AIGQ (AI-Generated Query architecture), the first end-to-end generative framework for HintQ scenario. AIGQ is built upon three core innovations spanning training paradigm, policy optimization and deployment architecture. First, we propose Interest-Aware List Supervised Fine-Tuning (IL-SFT), a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking strategy to faithfully model nuanced user intent. Accordingly, we design Interest-aware List Group Relative Policy Optimization (IL-GRPO), a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties, enhanced by a model-based reward from the online click-through rate (CTR) ranking model. To deploy under strict real-time and low-latency requirements, we further develop a hybrid offline-online architecture comprising AIGQ-Direct for nearline personalized user-to-query generation and AIGQ-Think, a reasoning-enhanced variant that produces trigger-to-query mappings to enrich interest diversity. Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement.
Summary
Main Finding
AIGQ introduces the first end-to-end generative architecture deployed for pre-search (HintQ) query recommendation in e-commerce. By co-designing list-level supervised fine-tuning, a list-aware RL policy optimization algorithm, and a hybrid offline–online deployment, AIGQ achieves materially better personalization, cold-start generalization and serendipity than ID/co-click based baselines while meeting strict production latency and cost constraints. Large-scale online A/B tests on Taobao report consistent, substantial lifts in platform effectiveness and user engagement and the highest PVR contribution among competing recalls.
Key Points
-
Problem & setting
- HintQ: personalized, pre-search query recommendation on Taobao’s homepage with no active user query; input = user profile + behavior sequence; output = ranked list of K hint queries.
- Challenges: need to balance personalization accuracy and diversity/serendipity, cold-start generalization, and meet tight online latency/compute budgets.
-
Core contributions
- Interest-Aware List Supervised Fine-Tuning (IL-SFT): treats target as an ordered list (not just next-token prediction) built from session-aware behavior aggregation and interest-guided re-ranking to better reflect true user interest strength.
- Interest-aware List Group Relative Policy Optimization (IL-GRPO): an RL policy-gradient adaptation that computes advantage at the individual-query-in-list granularity and uses a dual-component reward (local query-level + global sequence-level), augmented by a model-based reward from an online CTR ranker.
- Hybrid offline–online deployment: two LLM variants
- AIGQ-Direct: lightweight nearline personalized user-to-query (u2q) generator for fast, per-user generation.
- AIGQ-Think: reasoning-enhanced, offline trigger-to-query (x2q) mapping generator that expands interest diversity and provides triggers for refinement.
- Engineering optimizations: item-to-text generator (to map items to textual surrogates), prompt compression (special tokens for structured fields and instruction pruning), chain-of-thought style reasoning distillation (Qwen3-32B teacher) for AIGQ-Think, and caching strategies (u2q / x2q) to meet latency.
-
Empirical outcome
- Deployed at scale on Taobao; reported substantial improvements in user engagement and business metrics (CTR, PVR contribution, etc.). The paper emphasizes consistent online gains though exact numbers are not provided in the excerpt.
Data & Methods
-
Data
- Large-scale industrial logs from Taobao: time-ordered user interaction histories (searches, item clicks, exposures, hint-query clicks), user profile attributes.
- Training labels constructed by combining system priors (production ranking scores) and feedback priors (clicks), plus LLM-generated candidates for coverage expansion (with LLM filtering).
-
Sample construction
- Two-stage unified pipeline:
- Session-aware behavior aggregation: - AIGQ-Direct: session = a single HintQ exposure event (kept if at least one of the three displayed queries was clicked). - AIGQ-Think: session = day-level window from first search entry to exit (aggregates cross-domain behaviors to discover diverse interests).
- Interest-guided label re-ranking: assemble and order candidate queries according to interest strength (clicks highest, then global searches, production-ranked queries, LLM-generated candidates).
-
Supervised training (IL-SFT)
- Treats the output as an ordered list z = [q1,...,qT]; loss is list log-likelihood: sum over t log P(zt | z<t, x; θ).
- AIGQ-Direct: flat top-K list supervised from production and click signals.
- AIGQ-Think: structured outputs (trigger → query lists) with explicit reasoning traces rk generated by a teacher LLM; distillation dataset includes (context, rationale, structured output).
-
Reinforcement learning (IL-GRPO)
- Extension of GRPO adapted for ordered list generation:
- Fine-grained advantage estimation per query within a generated list.
- Dual-component reward:
- Local query-level reward: assesses individual query relevance/quality (e.g., ROUGE-L, length/format penalties, CTR proxy).
- Global sequence-level reward: evaluates list coherence, coverage, diversity, repetition penalties.
- Model-based reward augmentation: uses an online CTR ranking model to produce additional reward signal aligning generation with short-term engagement.
- Practical RL techniques: entropy-adaptive clipping, dynamic entropy regularization, and rollout-based sequence evaluation to handle combinatorial list dependencies.
- Extension of GRPO adapted for ordered list generation:
-
Architecture & inference
- Item-to-text generator: fine-tuned Qwen3-32B to produce compact textual surrogates for item metadata.
- Prompt compression: special tokens for short encodings (<1_day_ago>,
, etc.) and pruning redundant instructions. - Hybrid deployment:
- AIGQ-Direct runs nearline to produce u2q cache (personalized).
- AIGQ-Think runs offline to produce trigger-to-query x2q mappings (diversification).
- Online system composes/refines results from caches and other retrievals to meet strict latency budgets.
Implications for AI Economics
-
Increased monetization potential and platform value
- Better pre-search intent capture (improved CTR, PVR contribution) likely increases downstream conversions and GMV by surfacing queries that lead to more relevant search sessions and purchases.
- Higher serendipity/diversity can increase user engagement, session length, and lifetime value—important economic levers for marketplaces.
-
Cost–benefit and operational economics
- LLM adoption usually raises inference cost; AIGQ’s hybrid architecture (nearline u2q + offline x2q + caching) mitigates per-request compute, enabling practical economics for high-throughput platforms.
- Item-to-text generation and prompt compression reduce input/output token counts, lowering inference cost further.
- The model-based CTR reward aligns generation to short-term monetizable signals, improving returns on model training/deployment investment.
-
Market-level and strategic effects
- Improved hint-query quality can reshape user search behaviors and discovery patterns, potentially increasing competition among merchants for discoverability.
- Changes in what users are recommended may shift demand distribution across products and categories; platform needs to manage marketplace fairness and merchant incentives.
-
Risks, externalities, and governance
- Behavioral shaping: proactively generated queries can bias users toward certain categories/products—requiring careful monitoring to avoid over-concentration and to preserve long-term user trust.
- Feedback loops: using CTR-model rewards and production signals risks reinforcing short-term engagement biases; balancing exploration (serendipity) vs exploitation is economically important.
- Privacy and data governance: heavy reliance on personal behavior logs mandates robust privacy, consent and compliance safeguards—nontrivial economic and regulatory costs.
- Fairness and merchant impact: algorithmic changes can advantage certain sellers/categories; platforms may need compensation, transparency or marketplace governance mechanisms.
-
Research and product directions with economic relevance
- Quantify downstream GMV and long-term retention lift attributable to list-level generative recommendations (causal experiments).
- Measure cost-per-incremental-GMV for LLM-based recall vs traditional retrieval to validate ROI.
- Explore dynamic pricing of promoted exposure slots if generated queries change demand patterns.
- Develop safe-reward designs that trade off short-term monetization vs long-term user value, to prevent economically harmful optimization.
Summary AIGQ shows how list-level generative modeling plus list-aware RL and a pragmatic hybrid deployment can make LLM-driven query recommendation economically viable and beneficial in a large e-commerce platform. For platforms considering LLMs, AIGQ illustrates key design levers—list supervision, model-based reward alignment, reasoning distillation, and offline/on‑line caching—that jointly govern the trade-offs between user utility, monetization, and operational cost.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| AIGQ is the first end-to-end generative framework for the HintQ (pre-search query recommendation) scenario. Innovation Output | positive | high | innovation_output |
0.1
|
| Interest-Aware List Supervised Fine-Tuning (IL-SFT) is a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking to faithfully model nuanced user intent. Decision Quality | positive | high | modeling of user intent (nuanced intent capture) |
0.6
|
| Interest-aware List Group Relative Policy Optimization (IL-GRPO) is a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties. Output Quality | positive | high | individual query relevance and global list properties |
0.6
|
| IL-GRPO is enhanced by a model-based reward from the online click-through rate (CTR) ranking model. Output Quality | positive | high | optimization quality via CTR-informed reward |
0.6
|
| A hybrid offline-online deployment architecture composed of AIGQ-Direct (nearline personalized user-to-query generation) and AIGQ-Think (reasoning-enhanced trigger-to-query mappings) enables meeting strict real-time and low-latency requirements while enriching interest diversity. Organizational Efficiency | positive | high | real-time/low-latency deployment and interest diversity |
0.6
|
| Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement. Adoption Rate | positive | high | platform effectiveness and user engagement (key business metrics) |
0.6
|
| AIGQ overcomes limitations of traditional HintQ methods (shallow semantics, poor cold-start performance, and low serendipity) that arise from reliance on ID-based matching and co-click heuristics. Output Quality | positive | high | cold-start performance, semantic richness, serendipity of recommended queries |
0.6
|