Adding query-aware supervision to discrete semantic identifiers lifts e-commerce search relevance and conversions: a hybrid system that combines contrastive residual quantization with LLM-predicted SIDs raised offline AUC by 1.54% and produced measurable online gains (+0.13% UCTR, +0.25% UCTCVR) in Tmall production tests.

DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling

Bokang Wang, Xing Fang, Mingmin Jin, Jing Wang, Zhentao Song, Guangxin Song, Jianbo Zhu · June 03, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A query-bridged contrastive quantization approach combined with LLM-predicted discrete semantic identifiers (SIDs) improves e-commerce search relevance, yielding +1.54% offline AUC and production lifts of +0.13% UCTR and +0.25% UCTCVR on Tmall.

Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.

Summary

Main Finding

DSIRM introduces query-bridged discrete Semantic Identifiers (SIDs) as structured, relevance-aware features for e-commerce ranking. By injecting query-item interaction supervision into a hierarchical residual quantizer (RQ‑VAE) and pairing that with an LLM that generates likely query SIDs, DSIRM produces discrete features that complement dense embeddings and materially improve ranking quality. On Tmall production data DSIRM yields +1.54 percentage-point AUC offline and significant online lifts (+0.13% UCTR, +0.25% UCTCVR).

Key Points

Problem: continuous embeddings entangle multiple semantic facets and struggle with fine-grained, query-dependent distinctions (especially in e‑commerce where titles are homogeneous and queries are short/ambiguous).
Repositioning SIDs: treat hierarchical discrete codes not as retrieval-only targets but as explicit, structured relevance features to augment DNN ranking.
Item-side learning (Contrastive RQ‑VAE):
- Dual-tower setup: frozen pre-trained query/item embeddings are projected into a shared latent space, then quantized by a shared hierarchical RQ‑VAE.
- Query-bridged contrastive objective (InfoNCE) is applied to quantized representations so quantizer partitions reflect query-item co-occurrence / relevance, not just unsupervised geometry.
- Loss = InfoNCE + commitment + reconstruction (weights: λ_InfoNCE=1.0, λ_commit=1.0, λ_recon=0.1).
- Category-aware first-level codebook: first-level code is constrained/mapped to primary category to protect tail categories (K1 aligned to 216 top categories).
- Final SID is the concatenation of code indices across L levels (example L=3).
Query-side learning (Generative LLM):
- Fine-tune an autoregressive LLM to predict item SIDs conditioned on a query using historical query↔item logs (supervised cross-entropy).
- At inference generate top-K SID sequences (beam search) to capture intent ambiguity.
Relevance scoring:
- Compute hierarchical prefix matching between generated query SIDs and item SID; deeper prefix matches yield higher discrete SID score (example mapping {level0,1,2,3}→{0.0,0.25,0.5,1.0}).
- Embed discrete score and concatenate with dense features (dm, mm_dm, ct, qs) into a 3‑layer MLP DNN for ordinal relevance prediction.
Empirical results & ablations:
- Large-scale production experiments (Tmall): SID learning used ~80M query-item pairs; relevance training used 1.6M labeled pairs (LLM-labeled with ~94% human consistency); test = 100k.
- RQ‑VAE topology: K1=216, K2=512, K3=512. Training hyperparams: batch=256, τ=0.07.
- LLM: Qwen3-0.6B fine-tuned (batch 128), beam K=5 at inference.
- DSIRM (static pre-trained embeddings + query-bridged RQ‑VAE) achieved AUC 0.9356 vs baseline 0.9202 (+1.54 pp).
- Ablations: removing contrastive bridge or the category constraint degrades AUC (contrastive learning and category-aware allocation are important). Static pre-trained embeddings for RQ‑VAE outperformed dynamic encoding.
- Online production lifts after deployment: +0.13% UCTR, +0.25% UCTCVR.

Data & Methods

Data
- SID learning: ~80M query-item interaction pairs from Tmall logs.
- Relevance scoring: 1.6M labeled pairs (labels generated by a Qwen3-30B model with careful prompts; validation shows ~94% agreement with human judgments); test 100k.
Model components & hyperparameters
- Pretrained dual-encoder relevance embeddings are frozen; learnable projection encoders map to R^d latent space.
- RQ‑VAE: hierarchical residual quantization with L levels; example L=3 with codebook sizes K1=216 (category-aligned), K2=512, K3=512.
- Joint training objective: L = λ_InfoNCE L_InfoNCE + λ_commit L_commit + λ_recon L_recon (used values: 1.0, 1.0, 0.1).
- Contrastive objective: symmetric InfoNCE applied to quantized representations of queries and items to form the “query bridge.”
- Category-aware codebook: first-level code fixed/mapped per primary category and updated by EMA to stabilize tail categories.
- Query SID generator: autoregressive LLM (Qwen3-0.6B in experiments) fine-tuned to predict SID sequences; beam search with top-K SIDs used at inference to represent intent ambiguity.
- Feature integration: hierarchical prefix match score (discrete ss) embedded and concatenated with dense features (dm, mm_dm, ct, qs) feeding a 3‑layer MLP for final relevance logits.
Evaluation
- Metrics: Precision, Recall, AUC; production metrics: UCTR, UCTCVR.
- Baselines: no-SID baseline and prior SID methods (DSI, TIGER) re-used with same query-side generator and downstream DNN to isolate item-SID quality effects.

Implications for AI Economics

Platform performance and revenue:
- Even small percentage gains in CTR/CVR at scale (here +0.13% UCTR, +0.25% UCTCVR) can translate to substantial revenue increases on large marketplaces. Discrete SIDs enable more fine-grained relevance that better matches buyers to items.
Improved matching and market efficiency:
- Query-bridged SIDs reduce search frictions for ambiguous and fine-grained intents (e.g., attribute-level distinctions), likely increasing consumer surplus and conversion rates, and affecting seller competition (favoring sellers with better attribute alignments).
Interpretability and product-level policy:
- Discrete codes are more interpretable than dense vectors — they can be mapped to clusters/categories — enabling platform operators to analyze, audit, and set policies (e.g., promotion targeting, category-level boosts) using explicit clusters rather than opaque embedding neighborhoods.
Long-tail and fairness considerations:
- The category-aware codebook is a practical mechanism to protect low-frequency categories; from an economic perspective this helps reduce winner-takes-all dynamics driven by head items. However, codebook design and mapping choices can embed biases (e.g., over/under-representation of certain sellers or item types), so monitoring is necessary.
Operational and cost trade-offs:
- Benefits come with added modeling complexity: training an RQ‑VAE with contrastive loss and maintaining a fine-tuned LLM (even a moderate 0.6B model) require compute and engineering resources. But discrete SIDs can be compact (indices) and efficient at inference and for feature storage, potentially reducing retrieval/feature-store costs and enabling cheap prefix-matching logic.
Research and policy questions for economists:
- How do discrete, interpretable clusters change seller strategies and pricing? Do sellers adjust listings to target favorable SIDs?
- Welfare analysis: quantify consumer surplus and seller surplus changes from improved relevance, especially across segments (head vs tail).
- Dynamics and externalities: does improved fine-grained matching concentrate demand (raising winner-take-all) or help niche sellers by better surfacing long-tail items?
- Robustness and drift: how often must codebooks and LLMs be retrained to reflect new inventory and shifting intents, and what are the economic costs/benefits of retraining cadence?
Practical recommendations for platform economists/managers:
- Use DSIRM-like discrete features to run targeted A/B tests measuring revenue, conversion, and distributional effects across sellers/categories.
- Monitor distributional impacts (which sellers/categories gain/lose) and evaluate whether category-aware constraints sufficiently protect long-tail supply.
- Consider combining SID-based feedback signals with price/advertising experiments to study second-order strategic responses by sellers.

If you want, I can (a) extract the precise algorithm pseudocode and losses into a one-page technical cheatsheet, (b) sketch an experiment design to measure welfare effects of DSIRM deployment, or (c) produce a short slide-ready summary highlighting business impact and operational considerations.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Evaluation uses large-scale production data and reports both offline (AUC) and online metrics (UCTR, UCTCVR), showing tangible business impact; however, the description lacks detail on experimental design (randomization, statistical significance, duration, holdout procedures) and is limited to a single platform, so causal claims about general effectiveness are plausible but not fully substantiated for broader settings. Methods Rigormedium — The paper proposes a clear technical contribution (query-bridged contrastive quantization with residual quantization and LLM-based SID prediction) and reports empirical gains, but the writeup as presented omits important methodological details (sample sizes, hyperparameters, ablation studies, significance tests, robustness checks), making it difficult to fully assess reproducibility and the sensitivity of results. SampleProduction e-commerce logs from Tmall: item catalog and dense embeddings; query-item interaction data (clicks, conversions) used to inject supervision into residual quantization; offline evaluation (AUC) and online A/B/deployment metrics (UCTR and UCTCVR) reported; exact sample sizes, time windows, and data splits not specified in the summary. Themesinnovation adoption GeneralizabilitySingle-platform study (Tmall/Alibaba) — may not generalize to other marketplaces, E-commerce product search domain — methods and gains may not transfer to other search/use-cases, Likely language/region-specific (Chinese queries/items) — LLM behavior may differ across languages, Requires substantial query-item interaction logs and production infrastructure — may not suit smaller firms, Relies on deployment of LLMs and hybrid architecture — computational cost and latency tradeoffs may limit adoption

Claims (8)

Claim	Direction	Confidence	Outcome	Details
We propose a Discrete Semantic Identifier Relevance Model (DSIRM) that explicitly models discrete relevance features for e-commerce search. Other	positive	high	explicit modeling of discrete relevance features	0.18
We present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. Other	positive	high	relevance-aware semantic partitions learned via supervised quantization	0.18
We explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Other	positive	high	resolution of tail queries and reduction of intent ambiguity	0.09
Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Other	positive	high	discriminative features complementary to dense signals	0.18
Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54%. Output Quality	positive	high	offline AUC	+1.54% 0.18
Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13% UCTR, +0.25% UCTCVR). Firm Revenue	positive	high	user click-through rate (UCTR) and user click-to-conversion rate (UCTCVR)	+0.13% UCTR, +0.25% UCTCVR 0.18
The deployment and online lifts demonstrate the approach's industrial value. Firm Revenue	positive	high	industrial value as evidenced by online metric improvements	+0.13% UCTR, +0.25% UCTCVR 0.18
Existing SID generation methods rely heavily on unsupervised quantization, and in realistic scenarios the lack of explicit supervision makes it difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. Other	negative	high	capability for query-dependent ranking (limitation)	0.09