GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation

Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.

Summary

Main Finding

GenRec is a production-deployed, decoder-only generative retrieval recommender that (1) trains with page-wise next-token prediction to resolve one-to-many ambiguity in paginated interactions, (2) compresses multi-token Semantic IDs at the prompt using an asymmetric linear Token Merger to halve input length with negligible accuracy loss, and (3) aligns outputs to user satisfaction with a reinforcement-learning stage (GRPO-SR) combining group-relative policy optimization, a hybrid reward (dense SIM-based score gated by relevance), and NLL regularization. On JD.com traffic, GenRec delivered +9.5% click count and +8.7% transaction count vs the existing pipeline in month-long A/B tests.

Key Points

Page-wise NTP (PW-NTP): supervise the model to predict the whole interaction page (ordered/clicked/exposed items) instead of point-wise single-item next-token prediction. Benefits: denser gradients, resolves input→multiple-valid-outputs ambiguity from pagination, faster convergence, and much lower hallucination rate.
Token Merger (asymmetric): during prompt prefilling, concatenate the multi-token SID embeddings (hierarchical SID triplet) and project them to one vector via a linear layer to reduce input length ≈2×. Decoding still uses full-resolution SID tokens to keep retrieval granularity.
GRPO-SR (RL alignment): a one-step, group-relative policy optimization that
- uses a dense SIM-based reward model to score rollouts,
- applies a relevance gate G to suppress reward-hacked but irrelevant SID sequences,
- anchors real positive items in a rollout group to the group maximum reward,
- and adds an NLL supervised-regularization term over observed positive trajectories to prevent over-optimization and stabilize training.
Empirical gains (offline): GenRec (Qwen2.5-3B backbone) outperforms prior generative (TIGER, LC-Rec) and traditional sequential recommenders on HR@K and NDCG, with lower hallucination rate. Example: HR@1 0.1189 (GenRec) vs 0.0947 (LC-Rec).
Model scaling: moving from 1.5B→3B yields notable gains; 3B→7B shows marginal improvements, suggesting diminishing returns and sensitivity to depth/width trade-offs (deeper narrower 3B performed well).
RL impact: GRPO-SR increases reward metrics and HR (e.g., HR@50 from 0.7192 → 0.7438) and substantially improves top-1 reward; gating is crucial to avoid reward hacking.
Production results: month-long A/B test on JD App (10% traffic cohorts) — base GenRec SFT gave large improvements vs baseline (+8.5% clicks, +7.3% transactions); adding GRPO-SR gave +9.5% clicks and +8.7% transactions and higher exposure of long-tail items.

Data & Methods

Data: industrial dataset from JD.com — ~560 million user interaction sequences collected over one month; last-day used for testing.
Item representation:
- Multimodal encoder (e.g., Qwen2.5-VL) to get continuous item embeddings (visual + textual).
- Iterative RQ K-means vector quantization produces hierarchical Semantic IDs (SID triplet per item): SID(vi) = {s1, s2, s3}.
Architecture:
- Decoder-only transformer (Qwen2.5 variants: 1.5B, 3B, 7B).
- Prompt-side Token Merger: Linear(Concat(e(s1), e(s2), e(s3))) → single vector hvi during prefilling only.
- Decoding uses atomic SID tokens (no merging) to generate valid item tokens.
Training pipeline:
- Stage 1: Supervised Fine-Tuning (SFT) with Page-wise NTP loss (autogressive cross-entropy over whole page sequence Ypage).
- Stage 2: RL alignment with GRPO-SR — rollouts generate candidate items using point-wise beam search (matching serving), compute gated hybrid rewards, group-relative advantage, importance-sampled policy gradient plus NLL regularizer over positive trajectories.
Reward model and anti-hacking:
- Dense preference model r_pref ∈ [0,1] (SIM-based).
- Gate G_i = I(s_i > τ) to ensure semantic relevance (τ small).
- Calibration: positive real interactions in D+ get assigned group max reward to prevent under-weighting.
Compute / training details:
- Trained on 8 NVIDIA H100 GPUs using AdamW, linear warmup + cosine LR schedule.

Implications for AI Economics

Cost vs benefit of generative retrieval:
- Business impact: measurable lift in clicks and conversions (+≈9–10% clicks, +≈8–9% transactions) implies material revenue upside from better candidate generation and alignment.
- Compute costs: models use large transformer backbones (1.5–7B) and multimodal encoders; Token Merger mitigates inference costs by halving prompt length (lower latency and token-processing cost) while preserving decoding fidelity — an operationally important, low-complexity optimization for scaling LLM-based recommenders.
Value of training/inference asymmetry:
- PW-NTP training (list-wise) + point-wise inference reconciles richer supervision with production beam-search pipelines; this suggests platforms can gain accuracy without redesigning serving infrastructure — lower migration cost.
Investment in reward modeling and safeguards:
- Dense reward models (e.g., SIM) enable denser gradients for RL alignment but are vulnerable to reward hacking. The economic lesson: deployment of RL-aligned recommenders requires investment in (a) robust reward models, (b) simple gating/calibration mechanisms, and (c) monitoring/penalty systems — these are recurring operational costs that must be budgeted.
Diminishing returns on model scale:
- Observed marginal gains past ~3B parameters imply that for recommender applications, platform operators should evaluate depth/width choices and ROI for larger models rather than blindly scaling. Token-level and architectural improvements can deliver more cost-effective gains than pure parameter scaling.
Marketplace effects & externalities:
- Increased exposure of long-tail items (reported higher exposure and conversions) can shift platform-side supply/demand dynamics — influencing seller visibility, pricing, and long-term consumer welfare.
- Careful measurement needed: higher clicks/transactions may change per-user satisfaction and downstream retention; reward-aligned models must be audited for filter bubbles, fairness, and diversity impacts.
Practical deployment considerations:
- Data requirements are substantial (hundreds of millions of sessions) and heavy compute is needed for training and reward-model maintenance. Smaller platforms must weigh benefits vs data/compute investments.
- Integration with existing retrieval-and-rank pipelines is feasible due to point-wise inference compatibility, lowering switching friction.
Research & policy directions:
- Economists and platform managers should quantify marginal revenue per model improvement (e.g., per % click uplift) vs incremental compute + monitoring costs.
- Ongoing monitoring of hallucination rates and reward model drift is essential; contractual SLAs and interpretability for RL-driven ranking should be considered.

Limitations / open questions (for economic assessment) - Reward model maintenance cost and reliability across domains is unclear — changes in content or user behavior may require frequent retraining. - Generalizability: results are from a large e-commerce platform (JD); performance and ROI may differ on smaller or different-domain platforms. - Potential unintended incentives (e.g., promoting items that maximize short-term engagement but reduce long-term retention) require lifecycle metrics beyond clicks/transactions.

Assessment

Paper Typerct Evidence Strengthhigh — The paper reports month-long online A/B test results on production traffic with large-scope business outcomes (clicks and transactions), which provides strong causal evidence for the deployed system's impact; however, details on randomization unit, statistical significance, heterogeneity, and robustness checks are not provided in the summary, and results are from a single platform. Methods Rigorhigh — Evaluation uses a production randomized experiment and measures user-facing business metrics; the system design addresses known ML engineering challenges (training objective, input compression, and RL alignment) and includes stabilizing regularization and hybrid rewards—indicating rigorous empirical engineering—though the paper summary omits specifics on sample sizes, randomization checks, confidence intervals, and longer-term effects. SampleMonth-long production deployment on the JD App serving real users and traffic; model trained on historical user-item interaction sequences using multi-token semantic IDs (compressed at inference via Token Merger); RL fine-tuning used a learned dense reward model plus relevance gating; exact sample size, user demographics, and traffic split details not reported in the summary. Themesinnovation adoption IdentificationRandomized online A/B experiment on production traffic: users/requests were (implicitly) randomly assigned to receive GenRec versus the incumbent pipeline, and causal effects on click and transaction counts are estimated from the controlled split; auxiliary causal claims about algorithmic components are supported by offline ablations and RL fine-tuning but the primary causal identification of impact on business metrics relies on the randomized A/B test. GeneralizabilitySingle platform (JD App) and e-commerce product recommendations — may not generalize to other domains (news, video, social)., Solution depends on JD's pagination UI and multi-token semantic ID encoding; other UI/encoding designs may require adaptation., Requires production-scale infrastructure and decoder-only architecture; smaller services may not replicate gains., Month-long window may not capture long-run effects, novelty, or seasonal variation., Reported outcomes are clicks and transactions; consumer welfare, prices, and supplier-side effects are not measured.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
GenRec is deployed on the JD App. Adoption Rate	positive	high	deployment on JD App	0.6
In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count over the existing pipeline. Firm Revenue	positive	high	click count	9.5% improvement in click count 0.6
In month-long online A/B tests serving production traffic, GenRec achieves 8.7% improvement in transaction count over the existing pipeline. Firm Revenue	positive	high	transaction count	8.7% improvement in transaction count 0.6
Page-wise NTP (next-token prediction) task supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. Other	positive	high	training signal density / ambiguity resolution	0.6
An asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. Other	positive	high	input length (prompt length) and model accuracy	~2X reduction in input length 0.6
GRPO-SR (Group Relative Policy Optimization with NLL regularization and Hybrid Rewards) aligns generative policy outputs with user satisfaction, provides training stability, and mitigates reward hacking via a dense reward model combined with a relevance gate. Output Quality	positive	high	alignment with user satisfaction / training stability / mitigation of reward hacking	0.6
Within a single request, identical model inputs may produce inconsistent outputs due to the pagination request mechanism (a challenge for GR/NTP recommendation at industrial scale). Other	negative	high	output consistency per request	0.3
Encoding long user behavior sequences with multi-token item representations based on semantic IDs is prohibitively costly (a scaling challenge). Other	negative	high	encoding cost / input length	0.3
Aligning the generative policy with nuanced user preference signals is a challenge for generative recommendation. Output Quality	negative	high	policy alignment with user preferences	0.3
GenRec addresses the three listed challenges within a single decoder-only architecture. Other	positive	high	ability to address listed challenges	0.6

A generative recommender deployed on JD's app increased clicks by 9.5% and purchases by 8.7% versus the incumbent pipeline; gains are attributed to page-wise next-token training, asymmetric token compression, and RL-based preference alignment.