Adding query-aware supervision to discrete semantic identifiers lifts e-commerce search relevance and conversions: a hybrid system that combines contrastive residual quantization with LLM-predicted SIDs raised offline AUC by 1.54% and produced measurable online gains (+0.13% UCTR, +0.25% UCTCVR) in Tmall production tests.
Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.
Summary
Main Finding
DSIRM introduces query-bridged discrete Semantic Identifiers (SIDs) as structured, relevance-aware features for e-commerce ranking. By injecting query-item interaction supervision into a hierarchical residual quantizer (RQ‑VAE) and pairing that with an LLM that generates likely query SIDs, DSIRM produces discrete features that complement dense embeddings and materially improve ranking quality. On Tmall production data DSIRM yields +1.54 percentage-point AUC offline and significant online lifts (+0.13% UCTR, +0.25% UCTCVR).
Key Points
- Problem: continuous embeddings entangle multiple semantic facets and struggle with fine-grained, query-dependent distinctions (especially in e‑commerce where titles are homogeneous and queries are short/ambiguous).
- Repositioning SIDs: treat hierarchical discrete codes not as retrieval-only targets but as explicit, structured relevance features to augment DNN ranking.
- Item-side learning (Contrastive RQ‑VAE):
- Dual-tower setup: frozen pre-trained query/item embeddings are projected into a shared latent space, then quantized by a shared hierarchical RQ‑VAE.
- Query-bridged contrastive objective (InfoNCE) is applied to quantized representations so quantizer partitions reflect query-item co-occurrence / relevance, not just unsupervised geometry.
- Loss = InfoNCE + commitment + reconstruction (weights: λ_InfoNCE=1.0, λ_commit=1.0, λ_recon=0.1).
- Category-aware first-level codebook: first-level code is constrained/mapped to primary category to protect tail categories (K1 aligned to 216 top categories).
- Final SID is the concatenation of code indices across L levels (example L=3).
- Query-side learning (Generative LLM):
- Fine-tune an autoregressive LLM to predict item SIDs conditioned on a query using historical query↔item logs (supervised cross-entropy).
- At inference generate top-K SID sequences (beam search) to capture intent ambiguity.
- Relevance scoring:
- Compute hierarchical prefix matching between generated query SIDs and item SID; deeper prefix matches yield higher discrete SID score (example mapping {level0,1,2,3}→{0.0,0.25,0.5,1.0}).
- Embed discrete score and concatenate with dense features (dm, mm_dm, ct, qs) into a 3‑layer MLP DNN for ordinal relevance prediction.
- Empirical results & ablations:
- Large-scale production experiments (Tmall): SID learning used ~80M query-item pairs; relevance training used 1.6M labeled pairs (LLM-labeled with ~94% human consistency); test = 100k.
- RQ‑VAE topology: K1=216, K2=512, K3=512. Training hyperparams: batch=256, τ=0.07.
- LLM: Qwen3-0.6B fine-tuned (batch 128), beam K=5 at inference.
- DSIRM (static pre-trained embeddings + query-bridged RQ‑VAE) achieved AUC 0.9356 vs baseline 0.9202 (+1.54 pp).
- Ablations: removing contrastive bridge or the category constraint degrades AUC (contrastive learning and category-aware allocation are important). Static pre-trained embeddings for RQ‑VAE outperformed dynamic encoding.
- Online production lifts after deployment: +0.13% UCTR, +0.25% UCTCVR.
Data & Methods
- Data
- SID learning: ~80M query-item interaction pairs from Tmall logs.
- Relevance scoring: 1.6M labeled pairs (labels generated by a Qwen3-30B model with careful prompts; validation shows ~94% agreement with human judgments); test 100k.
- Model components & hyperparameters
- Pretrained dual-encoder relevance embeddings are frozen; learnable projection encoders map to R^d latent space.
- RQ‑VAE: hierarchical residual quantization with L levels; example L=3 with codebook sizes K1=216 (category-aligned), K2=512, K3=512.
- Joint training objective: L = λ_InfoNCE L_InfoNCE + λ_commit L_commit + λ_recon L_recon (used values: 1.0, 1.0, 0.1).
- Contrastive objective: symmetric InfoNCE applied to quantized representations of queries and items to form the “query bridge.”
- Category-aware codebook: first-level code fixed/mapped per primary category and updated by EMA to stabilize tail categories.
- Query SID generator: autoregressive LLM (Qwen3-0.6B in experiments) fine-tuned to predict SID sequences; beam search with top-K SIDs used at inference to represent intent ambiguity.
- Feature integration: hierarchical prefix match score (discrete ss) embedded and concatenated with dense features (dm, mm_dm, ct, qs) feeding a 3‑layer MLP for final relevance logits.
- Evaluation
- Metrics: Precision, Recall, AUC; production metrics: UCTR, UCTCVR.
- Baselines: no-SID baseline and prior SID methods (DSI, TIGER) re-used with same query-side generator and downstream DNN to isolate item-SID quality effects.
Implications for AI Economics
- Platform performance and revenue:
- Even small percentage gains in CTR/CVR at scale (here +0.13% UCTR, +0.25% UCTCVR) can translate to substantial revenue increases on large marketplaces. Discrete SIDs enable more fine-grained relevance that better matches buyers to items.
- Improved matching and market efficiency:
- Query-bridged SIDs reduce search frictions for ambiguous and fine-grained intents (e.g., attribute-level distinctions), likely increasing consumer surplus and conversion rates, and affecting seller competition (favoring sellers with better attribute alignments).
- Interpretability and product-level policy:
- Discrete codes are more interpretable than dense vectors — they can be mapped to clusters/categories — enabling platform operators to analyze, audit, and set policies (e.g., promotion targeting, category-level boosts) using explicit clusters rather than opaque embedding neighborhoods.
- Long-tail and fairness considerations:
- The category-aware codebook is a practical mechanism to protect low-frequency categories; from an economic perspective this helps reduce winner-takes-all dynamics driven by head items. However, codebook design and mapping choices can embed biases (e.g., over/under-representation of certain sellers or item types), so monitoring is necessary.
- Operational and cost trade-offs:
- Benefits come with added modeling complexity: training an RQ‑VAE with contrastive loss and maintaining a fine-tuned LLM (even a moderate 0.6B model) require compute and engineering resources. But discrete SIDs can be compact (indices) and efficient at inference and for feature storage, potentially reducing retrieval/feature-store costs and enabling cheap prefix-matching logic.
- Research and policy questions for economists:
- How do discrete, interpretable clusters change seller strategies and pricing? Do sellers adjust listings to target favorable SIDs?
- Welfare analysis: quantify consumer surplus and seller surplus changes from improved relevance, especially across segments (head vs tail).
- Dynamics and externalities: does improved fine-grained matching concentrate demand (raising winner-take-all) or help niche sellers by better surfacing long-tail items?
- Robustness and drift: how often must codebooks and LLMs be retrained to reflect new inventory and shifting intents, and what are the economic costs/benefits of retraining cadence?
- Practical recommendations for platform economists/managers:
- Use DSIRM-like discrete features to run targeted A/B tests measuring revenue, conversion, and distributional effects across sellers/categories.
- Monitor distributional impacts (which sellers/categories gain/lose) and evaluate whether category-aware constraints sufficiently protect long-tail supply.
- Consider combining SID-based feedback signals with price/advertising experiments to study second-order strategic responses by sellers.
If you want, I can (a) extract the precise algorithm pseudocode and losses into a one-page technical cheatsheet, (b) sketch an experiment design to measure welfare effects of DSIRM deployment, or (c) produce a short slide-ready summary highlighting business impact and operational considerations.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We propose a Discrete Semantic Identifier Relevance Model (DSIRM) that explicitly models discrete relevance features for e-commerce search. Other | positive | high | explicit modeling of discrete relevance features |
0.18
|
| We present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. Other | positive | high | relevance-aware semantic partitions learned via supervised quantization |
0.18
|
| We explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Other | positive | high | resolution of tail queries and reduction of intent ambiguity |
0.09
|
| Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Other | positive | high | discriminative features complementary to dense signals |
0.18
|
| Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54%. Output Quality | positive | high | offline AUC |
+1.54%
0.18
|
| Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13% UCTR, +0.25% UCTCVR). Firm Revenue | positive | high | user click-through rate (UCTR) and user click-to-conversion rate (UCTCVR) |
+0.13% UCTR, +0.25% UCTCVR
0.18
|
| The deployment and online lifts demonstrate the approach's industrial value. Firm Revenue | positive | high | industrial value as evidenced by online metric improvements |
+0.13% UCTR, +0.25% UCTCVR
0.18
|
| Existing SID generation methods rely heavily on unsupervised quantization, and in realistic scenarios the lack of explicit supervision makes it difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. Other | negative | high | capability for query-dependent ranking (limitation) |
0.09
|