A redesigned generative search engine, OneSearch-V2, boosted on-site engagement and sales in live experiments—raising item clicks by ~4%, buyer conversion by ~3%, and order volume by ~2%—achieving these gains through latent-reasoning query understanding and self-distillation without increasing latency.
Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbf{OneSearch-V2}, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +3.05\% buyer conversion rate, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65\% in page good rate and +1.37\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
Summary
Main Finding
OneSearch-V2 is a generative e-commerce search framework that internalizes LLM reasoning via keyword-based chain-of-thought (CoT) and an information-asymmetric self-distillation pipeline, and replaces an external reward model with direct behavior-feedback preference alignment. The system raises retrieval and commercial performance (no extra inference cost) and reduces long-tail sparsity and information-bubble effects. Online A/B tests on Kuaishou report substantive business gains (e.g., ≈+4% item CTR, +~3% buyer conversion, +3.45% GMV) while keeping deployment latency unchanged.
Key Points
- Core innovations
- Thought-augmented query understanding: LLMs produce compact, high-information keyword-based CoTs for each ⟨query, user⟩ pair. These CoTs serve both as inference-time signals for ambiguous/long-tail queries and as privileged teacher-side context during training.
- Reasoning-internalized self-distillation: an information-asymmetric self-distillation where a CoT-augmented teacher guides a student that only sees the raw query; training uses R-Drop (prediction consistency) and FGM (adversarial robustness) in a unified forward pass to internalize reasoning into model weights (latent reasoning) without architectural additions or extra tokens.
- Behavior-feedback preference alignment: removes separately trained reward model and instead optimizes directly on composite user-interaction signals (query-item relevance + behavior), includes SID-overlap rate as auxiliary reward and a token-position marginal advantage to respect hierarchical SID generation.
- Tokenization finding
- For e-commerce search, unimodal (text-centric) tokenization with hierarchical keyword quantization (KHQE) outperforms multimodal encodings at comparable model sizes due to cross-modal noise and redundancy. KHQE gave better Recall@10 / MRR@10 in experiments.
- Empirical results
- Offline: higher recall and ranking for complex intents, improvements on long-tail and ambiguous queries.
- Online (representative metrics reported): +3.98% item CTR, +1.17% PV CTR, +2.90% buyer conversion rate (other reported variants: +3.05% buyer conversion), +2.11% order volume, +3.45% GMV. Manual eval: +1.65% page good rate, +1.37% query–item relevance.
- Operational advantages
- No additional inference cost or serving latency in deployment.
- Supports streaming updates to adapt quickly to new queries/intents.
- Mitigates reward-hacking and distributional bias typical of log-fitted reward models.
- Open resources
- Code and dataset cases released: https://github.com/benchen4395/onesearch-family.
Data & Methods
- Data
- Large-scale industrial dataset from Kuaishou Mall (examples: ~5M online clicked ⟨query, item⟩ pairs used in tokenization experiments; full paper reports extensive platform-level A/B tests).
- Evaluations: Recall@10, MRR@10 (offline), CTR (item & page), buyer conversion rate, order volume, GMV, manual quality metrics (page good rate, query-item relevance).
- Methods / pipeline details
- Thought-augmented query understanding:
- Use LLMs to generate constrained CoTs and extract dense keyword sets (keyword-based CoT) capturing intent, category, attributes, negative constraints, substitutes, and personalization cues.
- Inject these keywords as auxiliary input at inference for hard/long-tail queries or use them as privileged teacher input in training.
- Self-distillation:
- Construct information asymmetry: teacher sees {query + keyword CoT + user context}, student sees {query [+ user context?]}. Align student predictions to teacher via logit-level/self-distillation objectives.
- Regularize with R-Drop for consistency and FGM for input robustness; unified forward pass design to reduce compute during training.
- No extra modules/tokens required — reasoning encoded into model weights (latent reasoning).
- Preference alignment:
- Replace separately trained reward model with composite real user-feedback signals as direct rewards (conversion signals, relevance, format validity).
- Introduce SID overlap rate as auxiliary reward to enforce valid hierarchical SID generation.
- Token-position marginal advantage: assign learning signal respecting hierarchical nature of SID generation (prefix correctness matters differently than suffix).
- Support streaming updates for reward composition and model adaption.
- Thought-augmented query understanding:
- Tokenization comparison
- Compared unimodal text encoders (BGE, Qwen), multimodal encoders (Qwen-VL, CLIP variants), and KHQE (keyword hierarchical quantization + BGE). KHQE had the best recall and MRR while keeping model size and latency favorable.
Implications for AI Economics
- Direct business value and monetization
- Measurable uplift in CTR, conversion, order volume, and GMV translates to higher short-term revenue and improved ARPU for the platform — a concrete ROI from investing in latent reasoning + self-distillation.
- Improvements on long-tail queries increase monetization of niche inventory and reduce concentration on head sellers, potentially boosting seller-side income diversification.
- Cost-efficiency and scalability
- The design internalizes expensive LLM reasoning into a smaller deployed model via offline teacher generation and self-distillation, meaning high-quality LLM reasoning can be amortized without raising inference costs. This reduces serving compute and cost per query at scale, increasing margin.
- Market structure and competition
- Better long-tail retrieval and mitigated information bubbles can enhance market access for long-tail merchants, changing competitive dynamics and reducing winner-take-most effects.
- Faster streaming updates to preference signals enable quicker adaptation to trends and seasonal demand, improving platform responsiveness and competitiveness.
- Incentives, fairness, and externalities
- Replacing a stand-alone learned reward model with direct behavior feedback lowers risk of reward-hacking and historical bias amplification, improving allocative efficiency of user attention. But direct use of behavior signals raises privacy and feedback-loop concerns (e.g., manipulation via synthetic behaviors, advertisers adjusting to new signals).
- Altering ranking and visibility could change seller incentives (e.g., optimizing listings for keyword-based CoTs), potentially producing second-order market effects that platforms should monitor.
- Transferability and cost trade-offs
- The approach (LLM-generated privileged supervision + student distillation) is economically attractive: one-time/periodic LLM computation cost (offline) versus large recurring serving costs. Platforms can trade off offline LLM expense against long-term serving savings and revenue uplift.
- Policy and measurement consequences
- Streaming reward composition and auxiliary metrics (e.g., SID overlap) allow multi-objective optimization (relevance vs conversion vs format validity), giving platforms tools to optimize policy objectives (user satisfaction, seller diversity, revenue) explicitly and measure their economic trade-offs.
Overall, OneSearch-V2 demonstrates a pragmatic industry pattern: use powerful but costly LLMs offline to create high-value supervision (keyword CoTs), compress that reasoning into a deployable model through self-distillation, and replace brittle learned reward models with direct behavior-based alignment—yielding both performance and cost benefits with meaningful economic impact for a large commerce platform.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Generative Retrieval (GR) offers advantages over multi-stage cascaded architectures such as end-to-end joint optimization and high computational efficiency. Organizational Efficiency | positive | high | computational efficiency and ability to perform end-to-end joint optimization |
0.6
|
| OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. Firm Revenue | positive | high | commercial and operational benefits |
0.3
|
| OneSearch-V2 increases item CTR by +3.98% in online A/B tests. Firm Revenue | positive | high | item CTR |
+3.98% item CTR
0.6
|
| OneSearch-V2 increases buyer conversion rate by +3.05% in online A/B tests. Firm Revenue | positive | high | buyer conversion rate |
+3.05% buyer conversion rate
0.6
|
| OneSearch-V2 increases order volume by +2.11% in online A/B tests. Firm Revenue | positive | high | order volume |
+2.11% order volume
0.6
|
| Manual evaluation confirms gains in search experience quality, with +1.65% in page good rate. Output Quality | positive | high | page good rate |
+1.65% in page good rate
0.6
|
| Manual evaluation confirms gains in query-item relevance, with +1.37%. Output Quality | positive | high | query-item relevance |
+1.37% in query-item relevance
0.6
|
| OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency. Consumer Welfare | positive | high | information bubbles and long-tail sparsity (and inference/serving latency) |
0.3
|
| OneSearch-V2 includes a thought-augmented complex query understanding module that enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference. Output Quality | positive | high | query understanding capability (depth of understanding vs. shallow semantic matching) |
0.1
|
| OneSearch-V2 contains a reasoning-internalized self-distillation training pipeline that uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning. Output Quality | positive | high | ability to infer latent user intent beyond behavior logs |
0.1
|
| OneSearch-V2 introduces a behavior preference alignment optimization system which mitigates reward hacking arising from the single conversion metric and addresses personal preference via direct user feedback. Decision Quality | positive | high | mitigation of reward hacking from single-metric optimization and alignment with personal preferences |
0.1
|
| Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Output Quality | positive | high | query recognition and user profiling performance |
0.6
|