A generative recommender deployed on JD's app increased clicks by 9.5% and purchases by 8.7% versus the incumbent pipeline; gains are attributed to page-wise next-token training, asymmetric token compression, and RL-based preference alignment.
Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.
Summary
Main Finding
GenRec is a production-deployed, decoder-only generative retrieval recommender that (1) trains with page-wise next-token prediction to resolve one-to-many ambiguity in paginated interactions, (2) compresses multi-token Semantic IDs at the prompt using an asymmetric linear Token Merger to halve input length with negligible accuracy loss, and (3) aligns outputs to user satisfaction with a reinforcement-learning stage (GRPO-SR) combining group-relative policy optimization, a hybrid reward (dense SIM-based score gated by relevance), and NLL regularization. On JD.com traffic, GenRec delivered +9.5% click count and +8.7% transaction count vs the existing pipeline in month-long A/B tests.
Key Points
- Page-wise NTP (PW-NTP): supervise the model to predict the whole interaction page (ordered/clicked/exposed items) instead of point-wise single-item next-token prediction. Benefits: denser gradients, resolves input→multiple-valid-outputs ambiguity from pagination, faster convergence, and much lower hallucination rate.
- Token Merger (asymmetric): during prompt prefilling, concatenate the multi-token SID embeddings (hierarchical SID triplet) and project them to one vector via a linear layer to reduce input length ≈2×. Decoding still uses full-resolution SID tokens to keep retrieval granularity.
- GRPO-SR (RL alignment): a one-step, group-relative policy optimization that
- uses a dense SIM-based reward model to score rollouts,
- applies a relevance gate G to suppress reward-hacked but irrelevant SID sequences,
- anchors real positive items in a rollout group to the group maximum reward,
- and adds an NLL supervised-regularization term over observed positive trajectories to prevent over-optimization and stabilize training.
- Empirical gains (offline): GenRec (Qwen2.5-3B backbone) outperforms prior generative (TIGER, LC-Rec) and traditional sequential recommenders on HR@K and NDCG, with lower hallucination rate. Example: HR@1 0.1189 (GenRec) vs 0.0947 (LC-Rec).
- Model scaling: moving from 1.5B→3B yields notable gains; 3B→7B shows marginal improvements, suggesting diminishing returns and sensitivity to depth/width trade-offs (deeper narrower 3B performed well).
- RL impact: GRPO-SR increases reward metrics and HR (e.g., HR@50 from 0.7192 → 0.7438) and substantially improves top-1 reward; gating is crucial to avoid reward hacking.
- Production results: month-long A/B test on JD App (10% traffic cohorts) — base GenRec SFT gave large improvements vs baseline (+8.5% clicks, +7.3% transactions); adding GRPO-SR gave +9.5% clicks and +8.7% transactions and higher exposure of long-tail items.
Data & Methods
- Data: industrial dataset from JD.com — ~560 million user interaction sequences collected over one month; last-day used for testing.
- Item representation:
- Multimodal encoder (e.g., Qwen2.5-VL) to get continuous item embeddings (visual + textual).
- Iterative RQ K-means vector quantization produces hierarchical Semantic IDs (SID triplet per item): SID(vi) = {s1, s2, s3}.
- Architecture:
- Decoder-only transformer (Qwen2.5 variants: 1.5B, 3B, 7B).
- Prompt-side Token Merger: Linear(Concat(e(s1), e(s2), e(s3))) → single vector hvi during prefilling only.
- Decoding uses atomic SID tokens (no merging) to generate valid item tokens.
- Training pipeline:
- Stage 1: Supervised Fine-Tuning (SFT) with Page-wise NTP loss (autogressive cross-entropy over whole page sequence Ypage).
- Stage 2: RL alignment with GRPO-SR — rollouts generate candidate items using point-wise beam search (matching serving), compute gated hybrid rewards, group-relative advantage, importance-sampled policy gradient plus NLL regularizer over positive trajectories.
- Reward model and anti-hacking:
- Dense preference model r_pref ∈ [0,1] (SIM-based).
- Gate G_i = I(s_i > τ) to ensure semantic relevance (τ small).
- Calibration: positive real interactions in D+ get assigned group max reward to prevent under-weighting.
- Compute / training details:
- Trained on 8 NVIDIA H100 GPUs using AdamW, linear warmup + cosine LR schedule.
Implications for AI Economics
- Cost vs benefit of generative retrieval:
- Business impact: measurable lift in clicks and conversions (+≈9–10% clicks, +≈8–9% transactions) implies material revenue upside from better candidate generation and alignment.
- Compute costs: models use large transformer backbones (1.5–7B) and multimodal encoders; Token Merger mitigates inference costs by halving prompt length (lower latency and token-processing cost) while preserving decoding fidelity — an operationally important, low-complexity optimization for scaling LLM-based recommenders.
- Value of training/inference asymmetry:
- PW-NTP training (list-wise) + point-wise inference reconciles richer supervision with production beam-search pipelines; this suggests platforms can gain accuracy without redesigning serving infrastructure — lower migration cost.
- Investment in reward modeling and safeguards:
- Dense reward models (e.g., SIM) enable denser gradients for RL alignment but are vulnerable to reward hacking. The economic lesson: deployment of RL-aligned recommenders requires investment in (a) robust reward models, (b) simple gating/calibration mechanisms, and (c) monitoring/penalty systems — these are recurring operational costs that must be budgeted.
- Diminishing returns on model scale:
- Observed marginal gains past ~3B parameters imply that for recommender applications, platform operators should evaluate depth/width choices and ROI for larger models rather than blindly scaling. Token-level and architectural improvements can deliver more cost-effective gains than pure parameter scaling.
- Marketplace effects & externalities:
- Increased exposure of long-tail items (reported higher exposure and conversions) can shift platform-side supply/demand dynamics — influencing seller visibility, pricing, and long-term consumer welfare.
- Careful measurement needed: higher clicks/transactions may change per-user satisfaction and downstream retention; reward-aligned models must be audited for filter bubbles, fairness, and diversity impacts.
- Practical deployment considerations:
- Data requirements are substantial (hundreds of millions of sessions) and heavy compute is needed for training and reward-model maintenance. Smaller platforms must weigh benefits vs data/compute investments.
- Integration with existing retrieval-and-rank pipelines is feasible due to point-wise inference compatibility, lowering switching friction.
- Research & policy directions:
- Economists and platform managers should quantify marginal revenue per model improvement (e.g., per % click uplift) vs incremental compute + monitoring costs.
- Ongoing monitoring of hallucination rates and reward model drift is essential; contractual SLAs and interpretability for RL-driven ranking should be considered.
Limitations / open questions (for economic assessment) - Reward model maintenance cost and reliability across domains is unclear — changes in content or user behavior may require frequent retraining. - Generalizability: results are from a large e-commerce platform (JD); performance and ROI may differ on smaller or different-domain platforms. - Potential unintended incentives (e.g., promoting items that maximize short-term engagement but reduce long-term retention) require lifecycle metrics beyond clicks/transactions.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| GenRec is deployed on the JD App. Adoption Rate | positive | high | deployment on JD App |
0.6
|
| In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count over the existing pipeline. Firm Revenue | positive | high | click count |
9.5% improvement in click count
0.6
|
| In month-long online A/B tests serving production traffic, GenRec achieves 8.7% improvement in transaction count over the existing pipeline. Firm Revenue | positive | high | transaction count |
8.7% improvement in transaction count
0.6
|
| Page-wise NTP (next-token prediction) task supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. Other | positive | high | training signal density / ambiguity resolution |
0.6
|
| An asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. Other | positive | high | input length (prompt length) and model accuracy |
~2X reduction in input length
0.6
|
| GRPO-SR (Group Relative Policy Optimization with NLL regularization and Hybrid Rewards) aligns generative policy outputs with user satisfaction, provides training stability, and mitigates reward hacking via a dense reward model combined with a relevance gate. Output Quality | positive | high | alignment with user satisfaction / training stability / mitigation of reward hacking |
0.6
|
| Within a single request, identical model inputs may produce inconsistent outputs due to the pagination request mechanism (a challenge for GR/NTP recommendation at industrial scale). Other | negative | high | output consistency per request |
0.3
|
| Encoding long user behavior sequences with multi-token item representations based on semantic IDs is prohibitively costly (a scaling challenge). Other | negative | high | encoding cost / input length |
0.3
|
| Aligning the generative policy with nuanced user preference signals is a challenge for generative recommendation. Output Quality | negative | high | policy alignment with user preferences |
0.3
|
| GenRec addresses the three listed challenges within a single decoder-only architecture. Other | positive | high | ability to address listed challenges |
0.6
|