The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A three‑phase distillation recipe compresses large retrieval models into compact encoders that nearly match teacher accuracy while slashing latency and raising ad revenue; deploying the student retriever in Bing Ads produced ~98% precision recovery, up to 27× lower latency, and a 1% revenue lift in production A/B tests.

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval
Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani, Amit Singh, Manik Varma · May 22, 2026
arxiv rct high evidence 9/10 relevance Source PDF
HARNESS-LM distills a billion-parameter retriever into a <600M student that recovers >98% of teacher precision while cutting encoder latency up to 27x and producing a +1% revenue uplift in a Bing Ads online A/B test.

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

Summary

Main Finding

HARNESS-LM (HLM) is a practical three-phase training recipe that transfers the retrieval quality of large Small Language Model (SLM) retrievers into compact, deployable query encoders. HLM recovers >98% of a high-capacity teacher retriever’s precision while yielding large production efficiency gains (up to 27× lower query-encoder latency and 20× higher throughput on A100 GPUs). Online A/B tests on Bing Ads show business gains (+1% Revenue, +0.6% Impressions, +0.4% Clicks) when replacing the prior production ensemble with an HLM-trained compact model.

Key Points

  • Three-phase recipe:
  • Phase 1 — Teacher: train a high-fidelity teacher retriever using large SLM backbones (e.g., Qwen3-Embedding 4B/8B) and optionally richer offline/oracle features.
  • Phase 2 — L2 Alignment: train a compact student query encoder (<600M params in experiments) to match the teacher query embeddings via direct ℓ2 regression on an unsupervised alignment corpus; the teacher is frozen.
  • Phase 3 — Contrastive Refinement (CR): freeze the teacher document encoder and refine the aligned student via supervised contrastive learning (Qwen-style InfoNCE with mined hard negatives and same-tower negatives).
  • Alignment via simple ℓ2 regression proved more effective/practical than more complex alternatives (contrastive distillation or kernel alignment) for making the student compatible with a frozen teacher document space.
  • Optional progressive structured pruning (layer + FFN unit removal) with re-alignment between pruning steps yields compact dense models with real latency benefits (successive prune-and-align pipeline, cascade pruning inspired).
  • Empirical trade-offs:
    • Student (0.6B) aligned to teacher (4B) recovers teacher-level retrieval when used with frozen teacher document embeddings.
    • Pruning can reduce latency dramatically (reported up to 6× in a CPU/low-end-GPU target example) with minimal precision loss (e.g., 1.1% absolute drop in precision in one case).
    • Overall production contrasts: up to 27× lower online query-encoder latency and 20× higher throughput (A100), while retaining >98% precision of the teacher.
  • Real-world validation: offline Bing Ads retrieval evaluation + online A/B test on live traffic showing production metric lifts.

Data & Methods

  • Data:
    • Alignment corpus: large unlabeled query-like text corpus used for ℓ2 alignment.
    • Supervised contrastive data: labeled query–document (query–ad) pairs with hard negatives for CR phase.
    • Evaluation: internal Bing Ads retrieval benchmark and live A/B traffic metrics (Revenue, Impressions, Clicks).
  • Models / Architectures:
    • Teacher: SLM-based encoders (Qwen3-Embedding 4B/8B) trained with a contrastive objective.
    • Student: compact SLM-derived query encoder (examples: Qwen3-Embedding 0.6B, pruned variants down to 2–14 layers).
    • Asymmetric deployment: frozen teacher document encoder used offline to build ANN index; compact student query encoder used online.
  • Objectives:
    • Teacher trained with modified InfoNCE (Qwen CL) that includes mined hard negatives, same-tower negatives, and false-negative masking.
    • Alignment: direct ℓ2 regression Lalign = Σ ||f_S(q) − f_T(q)||_2^2 (teacher query encoder frozen).
    • Contrastive Refinement: supervised Qwen CL loss with frozen teacher document encoder.
  • Compression:
    • Structured pruning (Algorithm 1): layer importance scoring and FFN unit importance scoring; remove whole layers/FFN units to produce dense, smaller models.
    • Progressive prune-and-align (Algorithm 2): iteratively prune to target (KLayers, KFFN), re-align after each pruning step to avoid performance shock.
  • Evaluation metrics:
    • Retrieval precision/top-K metrics on offline benchmarks.
    • Latency and throughput measurements on NVIDIA A100 GPUs (and CPU/low-end GPU targets for pruned models).
    • Online business KPIs via A/B testing.

Implications for AI Economics

  • Cost-efficiency trade-off: HLM demonstrates a concrete pathway to capture much of large-SLM retrieval quality while dramatically reducing online inference cost. This shifts compute burden offline (teacher training and document encoding) and reduces per-query serving cost—improving ROI on model improvements.
  • Capital allocation: organizations can justify investment in large offline teacher models (higher R&D and training cost) because the teacher’s improvements can be distilled into much cheaper online components, generating direct revenue uplift with manageable serving cost.
  • Deployment strategy: the asymmetric design (heavy offline document encoder + compact online query encoder) is economically attractive for high-throughput, latency-sensitive markets (ads, e-commerce search). It enables using richer or oracle features offline without inflating online cost.
  • Market effects: even modest percentage gains in retrieval quality and latency can materially affect ad auctions, click volumes, and revenue; HLM’s reported +1% revenue and other uplifts exemplify the economic leverage of improved retrieval at scale.
  • Infrastructure and energy implications: moving complexity offline can reduce peak-serving GPU requirements and enable cheaper CPU or low-end GPU serving for compact encoders—reducing marginal energy costs per query and potentially lowering capex for online infrastructure.
  • Operational considerations and risks:
    • Reindexing cost & dynamics: because the approach relies on frozen teacher document embeddings for ANN indices, model updates or document-space drift necessitate offline reindexing; this has operational cost and latency implications that must be budgeted.
    • Alignment corpus quality: economic value depends on alignment data representing production query distribution—misalignment can degrade online performance and reduce realized ROI.
    • Concentration of investment: firms that can afford large offline compute to train high-quality teachers may gain competitive advantage, potentially widening gaps between large platforms and smaller players.
  • Generalizability: the HLM recipe applies to other latency-sensitive retrieval markets (recommendation, e-commerce search, QA pipelines) where asymmetric serving and offline precomputation are feasible—suggesting broad economic value across digital ad and search ecosystems.

Assessment

Paper Typerct Evidence Strengthhigh — The paper combines strong offline evidence (retrieval precision and latency/throughput benchmarks) with a large-scale online randomized A/B test that directly measures causal effects on revenue, impressions, and clicks, giving high internal validity for the reported business impacts; however external validity is limited to the deployed environment. Methods Rigorhigh — A multi-stage distillation recipe is evaluated with detailed ablations (alignment objectives, embedding dimensionality, scale, architecture, optimization), quantitative recovery of teacher precision, hardware throughput/latency benchmarks, and production A/B testing—indicating careful experimental design and multiple complementary evaluations; the summary does not report statistical significance tests or A/B sample sizes in detail, which would further strengthen transparency. SampleProprietary real-world Bing Ads datasets and evaluation benchmark (queries, ads, impressions, clicks, revenue) used for offline retrieval and precision measurements; models include a billion-parameter-scale SLM teacher (e.g., Qwen3-Embedding 4B/8B) and sub-600M student encoders; production online A/B traffic on Bing Ads for revenue/impression/click measurements; GPU performance measured on NVIDIA A100 hardware. Themesproductivity adoption IdentificationRandomized online A/B experiment on Bing Ads traffic comparing the HLM student retriever against the production ensemble, attributing differences in revenue, impressions, and clicks to the model change; supported by offline benchmark comparisons and latency/throughput measurements. GeneralizabilityResults come from a single commercial platform (Bing Ads) and may not generalize to other ad platforms, organic search, or non-ad retrieval tasks., Proprietary data, production ensemble baseline, and deployment specifics are not fully disclosed, limiting reproducibility., Hardware/throughput gains measured on NVIDIA A100 may differ on other accelerators or CPU-only deployments., Locale, user demographics, and advertiser mix on Bing may differ from other markets affecting revenue/impression effects., Teacher model architecture and pretraining/data specifics influence distillation outcomes and may not transfer to all large retrievers.

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings. Output Quality positive high retriever precision
over 98% of the reference retriever's precision
1.0
HLM delivers up to 27x lower online query-encoder latency on NVIDIA A100 GPUs. Task Completion Time positive high online query-encoder latency
up to 27x lower online query-encoder latency
1.0
HLM delivers up to 20x higher throughput on NVIDIA A100 GPUs. Task Completion Time positive high inference throughput
20x higher throughput
1.0
Online A/B testing on Bing Ads shows a +1% Revenue uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model. Firm Revenue positive high Revenue
+1% Revenue
1.0
Online A/B testing on Bing Ads shows a +0.6% Impression uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model. Output Quality positive high Impressions
+0.6% Impression uplift
1.0
Online A/B testing on Bing Ads shows a +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model. Output Quality positive high Clicks
+0.4% Click uplift
1.0
HARNESS-LM (HLM) is a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models: (1) train a high-performance reference ('teacher') retriever by fine-tuning a billion-parameter-scale SLM; (2) align query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; (3) apply a final contrastive refinement stage to optimize the student for retrieval performance. Other positive high model compression / knowledge transfer into compact retriever
0.6
The paper presents a comprehensive empirical study of key design choices — including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies — to identify configurations that are most effective in production settings. Adoption Rate positive high design configuration effectiveness for production deployment
0.6
Large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks but their deployment in high-throughput, latency-sensitive environments remains impractical. Organizational Efficiency negative high deployability / practicality in latency-sensitive, high-throughput environments
0.6

Notes