A lightweight, feature-driven content encoder paired with a temporally-aware device tower boosts discovery of newly ingested videos: randomized production tests at Tubi show consistent gains in cold-start engagement, faster item promotion, and more impressions for new content.
Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi's production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.
Summary
Main Finding
Shallow-RHS — an asymmetric, temporal two-tower graph link-prediction architecture — successfully converts intrinsic content features into collaborative-filtering (CF)–aware embeddings for strict item cold-start. By keeping the content (RHS) tower shallow (no ID embeddings, no interaction-derived subgraphs) and allowing the device (LHS) tower to leverage temporally valid message passing over watch histories, the model (1) produces immediate, indexable embeddings for new titles, (2) implicitly completes the missing interaction neighborhood via nearest-neighbor retrieval of warm surrogate items, and (3) extends the same representation-completion idea to device cold-start via demographic cohort priors. Deployed at Tubi, this approach produced consistent online gains in cold-content engagement, promotion speed, impression acquisition, and first-touch device engagement.
Key Points
-
Problem framing
- Cold-start cast as an inductive graph-completion / temporal link-prediction task on a bipartite device–content graph where new content nodes have zero watch edges but full intrinsic features.
- Serving constraint: new content must have a standalone embedding usable for ANN retrieval immediately.
-
Architecture (Shallow-RHS)
- Asymmetric two-tower design:
- LHS (device tower): uses device features + temporally valid message passing over historical watch events (multi-hop neighborhood aggregation) to capture collaborative signals.
- RHS (content tower): intentionally shallow — encodes content only from intrinsic features (metadata, taxonomy, LLM semantic embeddings, etc.). No ID embeddings, no content-side neighbor aggregation, and no interaction-derived representations on RHS.
- Both towers use HeteroTF-style tabular encoders built with FT-Transformer (PyTorch Frame) to handle heterogeneous columns (text, categorical, numeric, precomputed embeddings).
- Scoring by cosine similarity in the shared CF-aware embedding space; trained with a temporal softmax link-prediction loss (positives from future window, temporally valid history for inputs).
- Asymmetric two-tower design:
-
Implicit graph completion & serving-time strategy
- After training, generate embeddings for warm and cold content via the RHS encoder and build an ANN index over warm item embeddings.
- For each cold item, retrieve top-M warm surrogate neighbors (nearest warm embeddings) to act as an interpretable behavioral proxy for promotion/retrieval without modifying the interaction graph.
- Device cold-start: cluster warm devices into demographic cohorts; represent a new device by the cohort-average embedding for immediate retrieval.
-
Deployment & empirical outcomes
- Implemented on Kumo GNN platform using one year of production watch logs.
- Graph scale: hundreds of millions of devices, hundreds of thousands of content items, billions of temporal watch edges.
- Data preprocessing: filter short views (rmin) and preserve temporal edges; training uses long-range history but labels sampled from a recent prediction window to avoid leakage.
- Large-scale online experiments at Tubi show consistent relative improvements in cold-content engagement metrics (total view time), promotion speed, impression acquisition, and first-touch device engagement. (Paper reports consistent relative gains but does not publish specific numeric lifts in the provided excerpt.)
Data & Methods
- Data
- Source: Tubi production watch-history logs.
- Temporal bipartite graph G≤t = (D ∪ C, E≤t, XD, XC) with timestamped watch edges (including total viewing time as an edge feature).
- One-year lookback for historical context; training labels sampled from the last K days (Tsup).
- Node features:
- Device: country/region, device type, platform, join time, available demographics/account attributes.
- Content: title text, type, genre, language, production year, duration, maturity rating, cast/director, external ratings, and dense LLM-based semantic embeddings from metadata/synopsis/scripts.
- Model components
- RHS content encoder: HeteroTF (FT-Transformer) that maps intrinsic features ⇒ z_c.
- LHS device encoder: initial HeteroTF embed h^(0)_d from x_d; constructs messages from historical edges using z_c, TVT (total viewing time encoder γ), and recency η; aggregates with permutation-invariant aggregators (mean/max etc.) and MLP updates across layers.
- Loss: temporal softmax contrastive loss (positives = future watched items, negatives = unwatched candidates) with temperature τ_s.
- Serving & inference
- Build ANN index over warm item embeddings; for cold items compute z_c and retrieve warm surrogates S_M(c).
- For device cold-start, infer cohort assignment and use cohort-average embedding z_g as the device embedding for retrieval against CF-aware content embeddings.
- Implementation
- Platform: Kumo GNN; frameworks: PyTorch Frame FT-Transformer for tabular/text/embedding encoding; ANN index for retrieval.
- Temporal validity rule enforced during training to avoid label leakage (device tower only sees τ < t; labels from (t, t+Δ]).
Implications for AI Economics
- Faster monetization of new content
- Immediate, CF-aware embeddings allow newly ingested titles to be surfaced and measured quickly, reducing time-to-first-impression and accelerating discovery → higher early lifetime value and improved ROI on content acquisition.
- More efficient promotion allocation
- Surrogate-neighbor strategy lets the system promote cold items based on proximity to warm, proven items; this can reduce manual curation and A/B testing costs for new titles and enable automated, scalable promotion policies.
- Reduction in opportunity and exploration costs
- By enabling retrieval-based exposure for new items, platforms can better exploit long-tail catalogs and reduce the opportunity cost of waiting for organic signals to accumulate.
- Operational and measurement benefits
- Having indexable embeddings from day one simplifies serving architectures (ANN retrieval) and downstream A/B measurement of content worth, facilitating quicker editorial or acquisition decisions.
- Market design and content acquisition strategy implications
- If the model reliably maps intrinsic signals to CF preferences, platforms may place greater weight on metadata/semantic investments (e.g., high-quality descriptions, LLM embeddings) in content acquisition and metadata enrichment budgeting.
- Cohort-based device priors reduce onboarding friction and may alter incentives around user acquisition channels (e.g., lower marginal cost to serve new device cohorts if cohort priors work well).
- Potential economic risks and trade-offs
- Popularity or surrogate bias: retrieving nearest warm neighbors may preferentially connect cold items to already popular warm titles, potentially reinforcing popularity bias and reducing exposure diversity — with implications for platform welfare and long-tail creator revenues.
- Dependence on feature quality: the method’s effectiveness depends on rich, reliable intrinsic features (including LLM embeddings). Investment costs for feature engineering/LLM compute become economically relevant inputs.
- Strategic behavior: content producers might optimize metadata/semantic signals to game cold-start retrieval, shifting incentives toward metadata manipulation unless safeguards are in place.
- Broader generalizability
- The representation-completion principle (learned, side-information-driven embeddings + surrogate retrieval) is applicable beyond streaming: marketplaces, job platforms, news/article recommendation, and other contexts with streaming item arrivals and immediate indexing needs.
- Policy and measurement suggestions for platforms
- Track diversity and novelty alongside short-term engagement to avoid long-run concentration effects.
- Quantify the marginal value of improved metadata (LLM embeddings) vs. incremental CF data collection to guide investment.
- Monitor and mitigate potential manipulation of intrinsic features used in RHS encoding.
Limitations and open questions (economics perspective) - The surrogate approach provides a behavioral proxy but does not create true interaction history; long-term retention and downstream engagement effects need continued monitoring. - No numeric effect sizes were reported in the excerpt — platform-specific lifts and ROI analyses are necessary to quantify economic impact. - Balancing immediate promotion of cold items with ensuring fair exposure among creators and maintaining catalog diversity requires explicit objectives beyond click/view maximization.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. Other | negative | high | cold-start challenge for new content (lack of interaction history) |
0.6
|
| In Tubi's production retrieval system, new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. Other | null_result | high | serving/operational constraint: immediate standalone content embedding and device embeddings for ANN |
0.3
|
| We formulate cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. Other | null_result | high | problem formulation (inductive graph-completion on temporal bipartite graph) |
0.6
|
| We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow and encodes content solely from intrinsic features. Other | null_result | high | model architecture behavior (device tower uses message passing; content tower shallow, intrinsic-features-only) |
0.6
|
| The RHS content tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. Other | positive | high | content encoder representation (mapping intrinsic features into CF-aware embedding space) |
0.6
|
| After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. Other | positive | high | ability to generate embeddings for new content and enable implicit graph completion |
0.6
|
| We extend the representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Other | null_result | high | device cold-start embedding construction (cohort-based demographics) |
0.6
|
| Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement. Adoption Rate | positive | high | content cold-start engagement |
0.6
|
| Large-scale online experiments demonstrate consistent relative improvements in promotion speed. Adoption Rate | positive | high | promotion speed (how quickly new content is promoted) |
0.6
|
| Large-scale online experiments demonstrate consistent relative improvements in impression acquisition. Adoption Rate | positive | high | impression acquisition (number/rate of impressions for content) |
0.6
|
| Large-scale online experiments demonstrate consistent relative improvements in device cold-start engagement. Adoption Rate | positive | high | device cold-start engagement |
0.6
|