Fine-tuned LLMs make ad delivery more stable and predictive: a semantic, graph-augmented retrieval layer reduced sensitivity to small creative changes and improved engagement in production A/B tests, demonstrating practical gains from LLMs in large-scale ad recommendation.

LLM Retrieval for Stable and Predictable Ad Recommendations

Vinodh Kumar Sunkara, Satheeshkumar Karuppusamy, Hangjun Xu, Sai Deepika Regani, Kshitij Gupta, Gaby Nahum, Sneha Iyer, Jean-Baptiste Fiot, Yinglong Guo, Xiaowen Guo, Atul Jangra, Yucheng Liu, Jinghao Yan, Vijay Pappu, Benjamin Schulte, Deepak Chandra · May 21, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

A production-scale, fine-tuned LLM-based semantic candidate generation and graph expansion pipeline substantially improved ad retrieval predictability and traditional engagement metrics in offline and randomized online A/B tests by making delivery robust to small creative perturbations.

Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG). With the hyper-growth of ads inventory and liquidity with generative AI technologies, the prediction stability and predictability is becoming increasingly critical. Intuitively, prediction stability and predictability can be defined to quantify system robustness with respect to minor/noisy input (ads, creatives) perturbations, the lack of which could lead to advertiser perceivable problems such as repeatability, cold start and under-exploration. In this paper, we introduce a new evaluation framework for quantifying stability and predictability of an ads recommender system, and present an online validated semantic candidate generation framework powered by fine-tuned Large Language Models (LLMs) that showed significant improvement along these metrics by fundamentally improving the semantic-awareness of the system. The approach extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion, ensuring the retrieved candidates encapsulate semantic variants of an ad, guaranteeing that small creative variants from the advertiser yield consistent and explainable delivery results to the user. We tested this LLM ads retrieval framework in a large-scale industrial ads recommendation system, demonstrating significant improvements across offline and online A/B experiments, showcasing gains in both predictability and traditional performance metrics. Although evaluated in the ads stack, this is a general framework that can be applied broadly to any large-scale recommendation and retrieval systems facing similar scaling and predictability challenges.

Summary

Main Finding

Fine-tuned LLMs used to extract hierarchical semantic attributes from ad creatives and to drive graph-based candidate expansion materially improve both traditional ad-retrieval metrics and a newly introduced predictability metric. In a large-scale online A/B test, the LLM retrieval layer increased final-stage recall and top-line ad performance while reducing unstable delivery across small creative perturbations (A/A’). The approach yields more stable, explainable, and semantically coherent candidate sets for downstream ranking.

Key Points

New focus: Introduces predictability as a first-class metric for ad systems — measuring how small, non-semantic perturbations of an ad (A → A’) change delivery and conversion outcomes.
- StatSigDiff: an aggregated, revenue-normalized measure of statistically significant A vs A’ conversion differences.
- Median Absolute Deviation (MAD) of daily relative impression differences used to quantify variability.
LLM-driven candidate generation pipeline:
- Fine-tune an instruct LLM on ad engagement data to produce hierarchical semantic attributes (categories, phrases, contextual captions).
- Build an ad-to-ad semantic graph (nodes = ads, edges = shared LLM-extracted attributes); use graph traversal and Jaccard-based fuzzy matching to expand seed candidates.
- Serve candidates in a scalable, low-latency retrieval service optimized for GPU batch inference.
Experimental results (online A/B):
- Top-line online metric: +0.45% (statistically significant) for the segment tested.
- Final-stage recall: +1.2% relative increase.
- Predictability: StatSigDiff (A/A’) reduced by 8.62% (test vs control).
- Variability (MAD) of daily impression relative differences improved by 45%.
Retrieval behavior: LLM prioritized high-quality candidates (higher concentration of quality at small Top-K), and contributed substantial incremental recall potential at larger K (e.g., up to ~1.89x at Top-200).
System-level considerations: used Llama3-8B Instruct (text-only) for zero-shot/fine-tuned inference; dataset ~tens of millions of textual ad descriptions; production design emphasizes horizontal GPU scaling and real-time retrieval service.

Data & Methods

Data:
- Input: textual ad creatives (title, description) and existing engagement labels; dataset size on the order of tens of millions.
- Shadow ads: created A/A’ pairs by publishing a copy of an ad with a minor, non-semantic perturbation to evaluate predictability.
Model:
- Base: Llama3-8B Instruct (open-source text LLM) fine-tuned on ad engagement / retrieval tasks.
- Outputs: hierarchical semantic tokens/attributes, category scores, phrase and token sets per ad.
Retrieval pipeline:
- Stage 1: LLM generates categories, attributes, contextual captions; yields (category, score) pairs per ad.
- Stage 2: Build semantic graph and compute ad-to-ad relevance SR(Ad1,Ad2) via overlap of category/phrase/token sets (Jaccard-style fuzzy matching with phrase-first fallback to token matches).
- Graph traversal expands seed ads to semantically similar candidates; scoring ranks candidates for downstream ranking stages.
Evaluation:
- Online A/B test: treatment = LLM-based candidate generator vs baseline (ensemble of two-tower, embedding, graph generators).
- Metrics: Recall@K, online top-line metric (clicks/conversions), StatSigDiff (A/A’ difference aggregated by revenue), and MAD of daily impression relative difference for A/A’ pairs.
- Statistical rules: StatSigDiff uses a 90% confidence threshold (1.65σ) for per-pair significance under Gaussian approximation.
Infrastructure:
- Distributed, GPU-backed inference serving; optimized batch processing and low-latency retrieval service.

Implications for AI Economics

Advertiser-facing value and market trust:
- Improved predictability (lower StatSigDiff and lower MAD) reduces advertiser uncertainty about delivery outcomes from small creative edits, strengthening advertiser trust and lowering perceived delivery risk — likely to reduce churn and increase willingness to run iterative creative tests.
- Better explainability (semantic attributes + graph neighborhoods) helps justify delivery patterns to advertisers and can support higher-value contracts or different pricing tiers for “predictable delivery” guarantees.
Auction dynamics and pricing:
- More consistent candidate pools reduce noise in auction outcomes, which can stabilize bid-to-win probabilities and potentially compress bid volatility; platforms may exploit this to offer finer-grained pacing and reserve strategies.
- If predictability concentrates higher-quality candidates earlier, it can affect supply of competitive impressions; platforms must account for changed supply elasticity when setting prices or reserve rules.
Cold-start and inventory expansion:
- Semantic generalization via LLMs helps cold-start new creatives and campaigns by placing them into semantic neighborhoods, reducing initial exploration costs and improving early performance estimates — lowering barriers to entry for small advertisers.
Welfare and externalities:
- Semantic grouping can increase relevance and user engagement, raising ad effectiveness (platform revenue) and user surplus (better content fit). However, homogenization risks (over-retrieving semantically similar creatives) could reduce ad diversity and advertiser differentiation over time.
Cost–benefit and capital intensity:
- Gains are modest in absolute top-line lift (+0.45%) and recall (+1.2%) but meaningful at scale; organizations must weigh these gains against increased inference costs (GPU ops, storage for graph indices, fine-tuning costs) and engineering complexity.
- The approach scales best where marginal revenue per impression is high enough to justify LLM serving costs; platforms should model ROI by segment (high-value verticals vs low-margin inventory).
Competition and adoption:
- If effective, LLM-based predictability can become a competitive differentiator; rival platforms may adopt similar semantic retrieval layers, shifting competition from pure prediction accuracy to delivery stability, explainability, and cost efficiency.
Risk management & policy:
- Reliance on LLMs raises operational risks (model drift, hallucinated attributes) and data-privacy considerations (sensitive creative text). Economic implementations should include monitoring, auditing, and fallback generation methods to manage downstream legal/compliance risks.

Overall, the paper demonstrates a practically deployable LLM-based retrieval layer that meaningfully improves ad delivery stability and recall. From an AI-economics perspective, the approach increases the predictability of a key market-making function (ad matching), with consequences for advertiser trust, auction stability, cold-start efficiency, and platform cost structures that operators should model carefully.

Assessment

Paper Typerct Evidence Strengthmedium — The use of online A/B tests in production gives credible causal evidence that the LLM retrieval pipeline changed outcomes, and offline metrics corroborate the findings; however, the paper (as summarized) omits key details needed to fully assess strength — e.g., sample sizes, randomization balance, test duration, effect sizes and confidence intervals, heterogeneity analyses, and whether results were replicated — which limits confidence in generality and magnitude of effects. Methods Rigormedium — The methodology combines sensible technical components (fine-tuned LLMs, hierarchical semantic attribute extraction, graph-based expansion) with standard A/B validation, but the rigor is weakened by lack of detail about experimental design (pre-registration, randomization unit, statistical corrections), model tuning/data splits, robustness checks, computational/latency trade-offs, and potential confounders (seasonality, inventory shifts). SampleProduction traffic from a large-scale industrial ads recommendation system: ad creatives (images/text), advertiser inventory and user impressions were used to build LLM-based semantic representations and to run both offline evaluations and randomized online A/B tests in the live system; exact sample sizes, platform details, geographic/vertical coverage, and duration are not reported in the summary. Themesinnovation adoption IdentificationRandomized online A/B experiments in a large-scale production ads recommender system comparing the LLM-based semantic candidate generation pipeline to the baseline retrieval pipeline; supported by offline evaluation of stability/predictability metrics and ablation-style analyses of semantic attribute extraction and graph expansion. GeneralizabilitySingle proprietary platform: results may not generalize to other ad platforms or recommendation stacks with different auction mechanics or inventory composition, Unspecified LLM/model and fine-tuning details: replication depends on model size, training data, and hyperparameters which are not provided, Ad format and market scope unclear: performance may differ across creatives types (video, native, display) or geographic/vertical markets, Computational cost and latency trade-offs not reported: benefits may be offset by inference costs or engineering constraints in other settings, Short or unreported A/B duration and timing: seasonal effects or short-run novelty could drive results, Focus on retrieval stage: downstream effects on auction dynamics, advertiser bidding behavior, and long-run marketplace equilibrium are not evaluated, Advertiser strategic responses and long-term metrics (e.g., retention, LTV) not assessed

Claims (8)

Claim	Direction	Confidence	Outcome	Details
We introduce a new evaluation framework for quantifying stability and predictability of an ads recommender system. Output Quality	positive	high	prediction stability and predictability (new evaluation metrics/framework)	0.6
We present an online validated semantic candidate generation framework powered by fine-tuned Large Language Models (LLMs) that showed significant improvement along these metrics by fundamentally improving the semantic-awareness of the system. Output Quality	positive	high	prediction stability and predictability; semantic-awareness of candidate generation	0.6
The approach extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion to retrieve semantic variants of an ad. Output Quality	positive	high	semantic coverage/representation of ad candidates (retrieval of semantic variants)	0.6
This LLM-based retrieval ensures that small creative variants from the advertiser yield consistent and explainable delivery results to the user. Output Quality	positive	medium	consistency/repeatability and explainability of delivery for small creative variants	0.18
We tested this LLM ads retrieval framework in a large-scale industrial ads recommendation system, demonstrating significant improvements across offline and online A/B experiments, showcasing gains in both predictability and traditional performance metrics. Output Quality	positive	high	predictability; traditional performance metrics (e.g., recall, NDCG, click/conversion prediction metrics)	0.6
Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG). Output Quality	negative	high	optimization focus on click/conversion prediction accuracy (recall, NDCG)	0.3
The lack of prediction stability and predictability can lead to advertiser-perceivable problems such as repeatability issues, cold start, and under-exploration. Output Quality	negative	high	repeatability, cold start, under-exploration (advertiser-perceived issues)	0.3
Although evaluated in the ads stack, this is a general framework that can be applied broadly to any large-scale recommendation and retrieval systems facing similar scaling and predictability challenges. Adoption Rate	positive	high	generalizability/applicability to other large-scale recommendation and retrieval systems	0.1