A redesigned generative search engine, OneSearch-V2, boosted on-site engagement and sales in live experiments—raising item clicks by ~4%, buyer conversion by ~3%, and order volume by ~2%—achieving these gains through latent-reasoning query understanding and self-distillation without increasing latency.

OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework

Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, Zhipeng Qian, Xinyu Sun, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Hui Kong, Jing Chen, Han Li, Chenyi Lei, Wenwu Ou, Kun Gai · March 25, 2026

arxiv rct medium evidence 8/10 relevance Source PDF

OneSearch-V2, a generative-search system enhanced with latent reasoning and self-distillation, produced measurable business gains in live A/B tests—+3.98% item CTR, +3.05% buyer conversion rate, and +2.11% order volume—while improving perceived search quality without added latency.

Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbf{OneSearch-V2}, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +3.05\% buyer conversion rate, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65\% in page good rate and +1.37\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.

Summary

Main Finding

OneSearch-V2 is a generative e-commerce search framework that internalizes LLM reasoning via keyword-based chain-of-thought (CoT) and an information-asymmetric self-distillation pipeline, and replaces an external reward model with direct behavior-feedback preference alignment. The system raises retrieval and commercial performance (no extra inference cost) and reduces long-tail sparsity and information-bubble effects. Online A/B tests on Kuaishou report substantive business gains (e.g., ≈+4% item CTR, +~3% buyer conversion, +3.45% GMV) while keeping deployment latency unchanged.

Key Points

Core innovations
- Thought-augmented query understanding: LLMs produce compact, high-information keyword-based CoTs for each ⟨query, user⟩ pair. These CoTs serve both as inference-time signals for ambiguous/long-tail queries and as privileged teacher-side context during training.
- Reasoning-internalized self-distillation: an information-asymmetric self-distillation where a CoT-augmented teacher guides a student that only sees the raw query; training uses R-Drop (prediction consistency) and FGM (adversarial robustness) in a unified forward pass to internalize reasoning into model weights (latent reasoning) without architectural additions or extra tokens.
- Behavior-feedback preference alignment: removes separately trained reward model and instead optimizes directly on composite user-interaction signals (query-item relevance + behavior), includes SID-overlap rate as auxiliary reward and a token-position marginal advantage to respect hierarchical SID generation.
Tokenization finding
- For e-commerce search, unimodal (text-centric) tokenization with hierarchical keyword quantization (KHQE) outperforms multimodal encodings at comparable model sizes due to cross-modal noise and redundancy. KHQE gave better Recall@10 / MRR@10 in experiments.
Empirical results
- Offline: higher recall and ranking for complex intents, improvements on long-tail and ambiguous queries.
- Online (representative metrics reported): +3.98% item CTR, +1.17% PV CTR, +2.90% buyer conversion rate (other reported variants: +3.05% buyer conversion), +2.11% order volume, +3.45% GMV. Manual eval: +1.65% page good rate, +1.37% query–item relevance.
Operational advantages
- No additional inference cost or serving latency in deployment.
- Supports streaming updates to adapt quickly to new queries/intents.
- Mitigates reward-hacking and distributional bias typical of log-fitted reward models.
Open resources
- Code and dataset cases released: https://github.com/benchen4395/onesearch-family.

Data & Methods

Data
- Large-scale industrial dataset from Kuaishou Mall (examples: ~5M online clicked ⟨query, item⟩ pairs used in tokenization experiments; full paper reports extensive platform-level A/B tests).
- Evaluations: Recall@10, MRR@10 (offline), CTR (item & page), buyer conversion rate, order volume, GMV, manual quality metrics (page good rate, query-item relevance).
Methods / pipeline details
- Thought-augmented query understanding:
  - Use LLMs to generate constrained CoTs and extract dense keyword sets (keyword-based CoT) capturing intent, category, attributes, negative constraints, substitutes, and personalization cues.
  - Inject these keywords as auxiliary input at inference for hard/long-tail queries or use them as privileged teacher input in training.
- Self-distillation:
  - Construct information asymmetry: teacher sees {query + keyword CoT + user context}, student sees {query [+ user context?]}. Align student predictions to teacher via logit-level/self-distillation objectives.
  - Regularize with R-Drop for consistency and FGM for input robustness; unified forward pass design to reduce compute during training.
  - No extra modules/tokens required — reasoning encoded into model weights (latent reasoning).
- Preference alignment:
  - Replace separately trained reward model with composite real user-feedback signals as direct rewards (conversion signals, relevance, format validity).
  - Introduce SID overlap rate as auxiliary reward to enforce valid hierarchical SID generation.
  - Token-position marginal advantage: assign learning signal respecting hierarchical nature of SID generation (prefix correctness matters differently than suffix).
  - Support streaming updates for reward composition and model adaption.
Tokenization comparison
- Compared unimodal text encoders (BGE, Qwen), multimodal encoders (Qwen-VL, CLIP variants), and KHQE (keyword hierarchical quantization + BGE). KHQE had the best recall and MRR while keeping model size and latency favorable.

Implications for AI Economics

Direct business value and monetization
- Measurable uplift in CTR, conversion, order volume, and GMV translates to higher short-term revenue and improved ARPU for the platform — a concrete ROI from investing in latent reasoning + self-distillation.
- Improvements on long-tail queries increase monetization of niche inventory and reduce concentration on head sellers, potentially boosting seller-side income diversification.
Cost-efficiency and scalability
- The design internalizes expensive LLM reasoning into a smaller deployed model via offline teacher generation and self-distillation, meaning high-quality LLM reasoning can be amortized without raising inference costs. This reduces serving compute and cost per query at scale, increasing margin.
Market structure and competition
- Better long-tail retrieval and mitigated information bubbles can enhance market access for long-tail merchants, changing competitive dynamics and reducing winner-take-most effects.
- Faster streaming updates to preference signals enable quicker adaptation to trends and seasonal demand, improving platform responsiveness and competitiveness.
Incentives, fairness, and externalities
- Replacing a stand-alone learned reward model with direct behavior feedback lowers risk of reward-hacking and historical bias amplification, improving allocative efficiency of user attention. But direct use of behavior signals raises privacy and feedback-loop concerns (e.g., manipulation via synthetic behaviors, advertisers adjusting to new signals).
- Altering ranking and visibility could change seller incentives (e.g., optimizing listings for keyword-based CoTs), potentially producing second-order market effects that platforms should monitor.
Transferability and cost trade-offs
- The approach (LLM-generated privileged supervision + student distillation) is economically attractive: one-time/periodic LLM computation cost (offline) versus large recurring serving costs. Platforms can trade off offline LLM expense against long-term serving savings and revenue uplift.
Policy and measurement consequences
- Streaming reward composition and auxiliary metrics (e.g., SID overlap) allow multi-objective optimization (relevance vs conversion vs format validity), giving platforms tools to optimize policy objectives (user satisfaction, seller diversity, revenue) explicitly and measure their economic trade-offs.

Overall, OneSearch-V2 demonstrates a pragmatic industry pattern: use powerful but costly LLMs offline to create high-value supervision (keyword CoTs), compress that reasoning into a deployable model through self-distillation, and replace brittle learned reward models with direct behavior-based alignment—yielding both performance and cost benefits with meaningful economic impact for a large commerce platform.

Assessment

Paper Typerct Evidence Strengthmedium — Uses live production A/B tests with business outcomes (CTR, conversion, order volume), which provide credible causal evidence in principle, but the paper omits key details (randomization protocol, sample sizes, test duration, statistical significance, pre-registration, heterogeneity analyses), limiting confidence in external validity and robustness. Methods Rigormedium — The work combines system design, offline evaluation, manual labeling, and live experiments—an appropriate mixed-methods approach for an industrial deployment—but lacks transparent reporting of experimental design, inference statistics, baseline definitions, and robustness checks expected in rigorous empirical research. SampleLive traffic on a production industrial-scale e-commerce search system (OneSearch) used for online A/B testing; also includes proprietary offline datasets for evaluation and manual human judgments for quality assessment; exact user counts, time windows, and geographic/user-segmentation details are not reported. Themesproductivity adoption IdentificationOnline A/B test (treatment = OneSearch-V2 vs baseline) on production e-commerce traffic, supplemented by offline evaluations and manual annotation; causal claims rest on randomized experiment design implied by 'A/B tests'. GeneralizabilitySingle proprietary e-commerce platform — results may not generalize to other platforms or industries, E-commerce search context (product discovery and purchase) — may not apply to non-commerce search tasks, Unknown geographic and demographic coverage — effects may vary across user populations, Proprietary system implementation and hyperparameters — replication by other providers could yield different magnitudes, Short-term A/B test effects (possible novelty or short-term engagement boosts) — long-term impact unclear, Limited reporting on heterogeneity — unclear how effects vary by query type, user segment, or product category

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Generative Retrieval (GR) offers advantages over multi-stage cascaded architectures such as end-to-end joint optimization and high computational efficiency. Organizational Efficiency	positive	high	computational efficiency and ability to perform end-to-end joint optimization	0.6
OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. Firm Revenue	positive	high	commercial and operational benefits	0.3
OneSearch-V2 increases item CTR by +3.98% in online A/B tests. Firm Revenue	positive	high	item CTR	+3.98% item CTR 0.6
OneSearch-V2 increases buyer conversion rate by +3.05% in online A/B tests. Firm Revenue	positive	high	buyer conversion rate	+3.05% buyer conversion rate 0.6
OneSearch-V2 increases order volume by +2.11% in online A/B tests. Firm Revenue	positive	high	order volume	+2.11% order volume 0.6
Manual evaluation confirms gains in search experience quality, with +1.65% in page good rate. Output Quality	positive	high	page good rate	+1.65% in page good rate 0.6
Manual evaluation confirms gains in query-item relevance, with +1.37%. Output Quality	positive	high	query-item relevance	+1.37% in query-item relevance 0.6
OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency. Consumer Welfare	positive	high	information bubbles and long-tail sparsity (and inference/serving latency)	0.3
OneSearch-V2 includes a thought-augmented complex query understanding module that enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference. Output Quality	positive	high	query understanding capability (depth of understanding vs. shallow semantic matching)	0.1
OneSearch-V2 contains a reasoning-internalized self-distillation training pipeline that uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning. Output Quality	positive	high	ability to infer latent user intent beyond behavior logs	0.1
OneSearch-V2 introduces a behavior preference alignment optimization system which mitigates reward hacking arising from the single conversion metric and addresses personal preference via direct user feedback. Decision Quality	positive	high	mitigation of reward hacking from single-metric optimization and alignment with personal preferences	0.1
Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Output Quality	positive	high	query recognition and user profiling performance	0.6