How you collect data determines how well you can evaluate AI: concentrating logging on high-reward actions cuts variance but risks missing signals a target policy will take, and this paper derives optimal and practical logging policies to minimize off-policy evaluation error under different knowledge regimes.

Logging Policy Design for Off-Policy Evaluation

Connor Douglas, Joel Persson, Foster Provost · May 14, 2026

arxiv theoretical low evidence 7/10 relevance Source PDF

The paper formalizes a reward-coverage tradeoff for logging-policy design in off-policy evaluation and derives theoretically optimal logging strategies under known, unknown, and prior/noisy-information regimes, plus practical heuristics when constraints prevent implementing the optimum.

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

Summary

Main Finding

Designing logging (exploration) policies substantially improves the accuracy and efficiency of off‑policy evaluation (OPE). There is a fundamental reward–coverage tradeoff: concentrating sampling on high‑reward actions reduces variance, while coverage over actions the target may take controls bias. The paper provides a unified framework for minimizing the IPW estimator mean squared error (MSE) across informational regimes, derives closed‑form optimal logging policies in canonical cases, prescribes shrinkage and empirical‑Bayes corrections when reward estimates are noisy, and gives practical, low‑engineering logging families (top‑k, softmax, power‑normalized) that interpolate between uniform and greedy sampling.

Key Points

Objective and estimator
- Focus: choose logging policy πl to minimize MSE of inverse‑propensity weighted (IPW) estimator ˆVIPW(πt | πl) for a target policy πt.
- IPW estimator: ˆVIPW = (1/N) Σi [πt(Ai|Xi) / πl(Ai|Xi) * Ri].
- MSE decomposes as Bias^2 + Var; bias arises when πl gives zero probability to actions that πt may select (weak overlap), variance driven by sampling randomness and reward variance.
Reward–coverage tradeoff
- Concentrating mass on probable/high‑reward actions lowers variance of IPW but risks bias by failing to cover actions the target might choose.
- Good logging policies balance concentrating on informative (high‑reward, high‑target-mass) actions and providing sufficient coverage.
Optimal logging policies in canonical regimes
- No information about πt or rewards: uniform randomization is minimax optimal (controls worst‑case error).
- Full information (πt and reward probabilities µ known): optimal logging policy per context takes a Neyman‑allocation form:
  - πl*(a|x) ∝ πt(a|x) · sqrt(µ(a, x)).
  - Intuition: weight sampling toward actions with high target mass and higher reward uncertainty/scale; this minimizes IPW variance subject to overlap.
  - Consequences: the optimized logging policy can (a) give lower MSE IPW estimates than on‑policy (A/B) evaluation and (b) accrue higher expected reward during logging than the target policy itself.
- Known distribution over target policies (multi‑evaluation planning): allocate toward a second‑moment pseudo‑target (weights derived from the expected squared target mass); authors give a plug‑in construction to implement this.
- Partially known/noisy reward estimates: using noisy µ̂ directly inflates downstream IPW variance. Solution: posterior shrinkage of µ̂ toward a per‑context across‑action mean under a Gaussian hierarchical prior; derive optimal shrinkage and show an empirical‑Bayes implementation is effective.
Practical policy classes and tuning
- When per‑context optimal propensities are infeasible, three single‑parameter families perform well:
  - Top‑k: pick k highest estimated items uniformly.
  - Softmax: probabilities ∝ exp(β · score).
  - Power‑normalized: probabilities ∝ score^β (with normalization).
- These interpolate between uniform (β → 0) and greedy (β large); tuning the greediness trades coverage vs. concentration. Simulations show well‑tuned members approach theoretical MSE optima. Guidance: more concentration (larger β) helpful in small‑sample regimes and when action set is moderate; more coverage needed as action space or target uncertainty grows.
Other insights
- Simple personalized logging can vastly outperform uniform exploration (example: 1,000 personalized samples matched 100,000 uniform samples in one simulation).
- Emphasis on design for OPE differs from bandit objectives: the aim is accurate offline evaluation (minimize estimator error), not cumulative online reward or quickly finding the best arm.
- Constraints (engineering, production risk, single‑slot assumption) matter; authors discuss safe logging and implementation tradeoffs.

Data & Methods

Formal model
- Contextual bandit setup with context space X, action set A, stationary context distribution pX, and Bernoulli rewards R | (A, X) ∼ Bernoulli(µ(a, x)).
- Designer chooses logging propensities πl(a|x); target policy πt may be known, unknown, or drawn from a known distribution.
- Assumptions: consistency and unconfoundedness by construction; weak overlap required to avoid bias.
Estimation objective
- Minimize finite‑sample MSE of IPW estimator across the randomness of contexts, logging actions, and rewards. MSE decomposed analytically into bias (due to missing support) and variance (from sampling).
Theoretical derivations
- Closed‑form optimal solutions derived in several informational regimes:
  - Minimax uniform policy for no information.
  - Neyman‑allocation style optimal policy when πt and µ are known: πl* ∝ πt · sqrt(µ).
  - Optimal allocations for distributional knowledge over πt (second‑moment pseudo‑target).
  - Optimal posterior shrinkage factor under Gaussian hierarchical prior and additive noise model for µ̂.
- Connections drawn to importance sampling, Neyman allocation in survey sampling, and optimal experimental design.
Simulations and examples
- Synthetic simulations (e.g., recommender with large action space) illustrate how personalized logging reduces MSE relative to uniform logging and how tuned soft‑greedy policies close the gap to theoretical optima.
- Example quantification: personalized logging with N=1,000 matches uniform logging with N=100,000 in one setup.
Limitations noted
- Focus on IPW (though many practical OPE estimators build on IPW); DR/self‑normalized variants not the main objective here.
- Single‑slot recommendation model (no multi‑slot or interference effects).
- Stationary context distribution and offline, nonadaptive logging design; adaptive/online learning adaptations not the focus.

Implications for AI Economics

Reduced experimentation cost and faster iteration
- Better logging design lowers the sample budget needed for reliable offline evaluation, decreasing time and user exposure costs for A/B tests or live rollouts. This accelerates product improvements and reduces opportunity cost.
Policy selection and product strategy
- Firms can design logging policies that both (i) improve estimation accuracy for many candidate target policies (multi‑evaluation) and (ii) provide acceptable or even improved on‑log value during data collection. This enables safer exploration while preserving user experience and revenue.
Allocation of scarce experimentation resources
- Treating experiment design as an economic allocation problem: choose propensities to minimize downstream decision risk (MSE) subject to operational constraints (user experience, engineering cost). The Neyman‑allocation result gives a principled rule for allocating scarce samples toward actions that matter most for the evaluation objective.
Implications for competition and welfare
- Platforms that adopt principled logging designs can evaluate and deploy better recommenders faster, potentially increasing user welfare and competitive advantage. However, concentrated logging could skew data availability across items, affecting long‑tail content and creators—regulatory or platform fairness considerations may arise.
Practical policy and governance recommendations
- Avoid defaulting to uniform exploration for OPE; consider estimated reward signals (with shrinkage) to inform logging propensities.
- Use single‑parameter soft‑greedy families when engineering constraints prevent per‑context optimization; tune greediness based on sample budget and action space.
- When planning to compare many candidate policies, design logging to cover the relevant action/support space (second‑moment pseudo‑target principle).
- Monitor overlap to avoid bias and ensure identifiability of target evaluations.
Research and deployment caveats
- Extensions needed for multi‑slot, interference, nonstationarity, adaptive logging, and for integration with other estimators (DR, self‑normalized IPW). Economic analyses should incorporate long‑run effects on catalogs and creator incentives when concentration in logging favors high‑reward items.

Overall, the paper gives actionable, theoretically grounded rules for how firms should choose logging policies to measurably reduce OPE error and experimentation costs, with clear recommendations for both idealized and constrained practical settings.

Assessment

Paper Typetheoretical Evidence Strengthlow — The paper provides analytic derivations and theoretical optimal policies, likely supported by simulations; it does not present field experiments or observational validation on real-world production systems, so empirical external validity for actual firm settings is untested. Methods Rigorhigh — The contribution is primarily formal: it characterizes a clear reward-coverage tradeoff, derives optimal logging policies across well-specified informational regimes (known, unknown, partially known), and provides constructive solutions and design principles; this indicates strong theoretical and mathematical rigor, assuming proofs and derivations are complete. SampleNo observational or experimental field sample; analysis uses analytic models of contextual bandits/recommender systems and synthetic simulation experiments across canonical reward distributions and target policies to illustrate theoretical results. Themesadoption productivity GeneralizabilityAssumes the reward-generating process and target policy fall into the modeled canonical regimes (known/unknown/prior-based); real environments may violate these assumptions, Relies on ability to implement arbitrary logging probabilities — operational constraints (engineering, product, fairness, legal) may restrict feasible logging policies, Results derived for stationary settings; non-stationarity or time-varying rewards may invalidate optimality, High-dimensional action or context spaces (large item catalogs) may limit computational tractability of proposed optimal policies, Ignores strategic user or agent responses to logging policies and potential spillovers in live systems, Empirical performance only demonstrated on synthetic/simulated data, not on diverse real-world datasets

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy, enabling high-stakes experimentation without live deployment. Decision Quality	positive	high	ability to estimate target policy value without live deployment (OPE capability)	0.12
In practice OPE accuracy depends heavily on the logging policy used to collect data for computing the estimate. Decision Quality	positive	high	OPE accuracy / OPE error	0.12
There is a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. Decision Quality	mixed	high	variance of OPE estimators and coverage of actions relevant to the target policy	0.2
We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Decision Quality	positive	high	optimal logging policies for minimizing OPE error under different informational assumptions	0.2
Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. Decision Quality	positive	medium	guidance usefulness for selecting recommendation systems (improved selection via better logging policy design)	0.07
We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. Decision Quality	positive	high	impact of treatment (logging) selection on OPE performance and derivation of optimal selection strategies	0.12
We distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum. Decision Quality	positive	high	practical guidance / design principles for logging policy selection under constraints	0.12