The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

A dominance-based RLHF method cuts rare harmful outputs without sacrificing helpfulness. Using entropically regularized optimal transport to enforce stochastic dominance gives tunable, provable control over tail risk that can be aligned with firm or regulator risk preferences.

Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control
Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum · March 11, 2026
arxiv theoretical medium evidence 7/10 relevance Source PDF
Replacing expected-cost constraints in RLHF with first-order stochastic dominance enforced via entropically regularized optimal transport (RAD) yields a differentiable alignment objective that provably improves broad classes of risk measures and empirically reduces harmful and OOD failures while remaining competitive on helpfulness.

Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy's cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model's risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.

Summary

Main Finding

Replacing expected-cost constraints in RLHF with First-Order Stochastic Dominance (FSD) constraints — implemented via an entropically regularized Optimal Transport (OT) comparison of cost distributions (the RAD method) — yields a differentiable, computationally tractable alignment objective that better controls tail and out-of-distribution (OOD) risks. A quantile-weighted FSD variant provably controls broad classes of Spectral Risk Measures (SRMs), letting practitioners tune risk profiles; empirically RAD improves harmlessness and OOD robustness while remaining competitive on helpfulness.

Key Points

  • Problem with expectations: Standard RLHF enforces safety via expected-cost constraints, which ignore distributional shape and can fail under heavy tails or rare catastrophic events.
  • Stochastic dominance alternative: FSD compares whole cost distributions, directly constraining tails and offering stronger guarantees against high-cost (unsafe) outcomes.
  • Operationalization (RAD): FSD constraints are implemented by comparing the learned policy’s rollout cost distribution to a reference policy’s distribution using Optimal Transport. Entropic regularization plus Sinkhorn iterations provide a differentiable, efficient objective suitable for end-to-end optimization.
  • Quantile weighting and spectral risks: Introducing quantile-weighted FSD lets one control Spectral Risk Measures (SRMs) — improving weighted-FSD implies guaranteed improvements in the associated SRM. This gives a principled knob to encode risk aversion/preferences.
  • Empirical trade-offs: RAD improves harmlessness and OOD robustness over baselines, with only modest or no loss in helpfulness in the reported experiments.
  • Practical considerations: RAD requires estimating cost distributions and choosing a reference policy and quantile-weighting function; these choices shape conservatism and sample efficiency.

Data & Methods

  • Setting: Reinforcement Learning from Human Feedback (RLHF) where policies produce a distribution of costs (safety violations, harmful outputs) across rollouts.
  • Constraint replacement: Swap scalar expected-cost constraints for FSD constraints that require the learned policy’s cost distribution to stochastically dominate (be no worse than) a reference policy’s distribution.
  • OT-based comparison: Use Optimal Transport to compare empirical cost distributions; apply entropic regularization to the OT objective and compute gradients via Sinkhorn iterations for efficient, differentiable optimization compatible with policy gradient / end-to-end training.
  • Quantile-weighted FSD: Define weighted dominance by reweighting quantiles; prove that weighted-FSD dominance implies improvement across the corresponding class of Spectral Risk Measures, giving formal risk guarantees.
  • Evaluation (as described): Compare RAD to baseline RLHF methods on helpfulness and harmlessness metrics, including OOD harmlessness evaluations, to assess robustness. (The paper reports RAD improves harmlessness and OOD robustness while remaining competitive on helpfulness.)

Implications for AI Economics

  • Risk management and externalities: RAD provides a tractable way to reduce tail risk and rare catastrophic harms from deployed AI systems, which lowers expected social costs, potential liabilities, and insurance premiums associated with high-impact failures.
  • Tunable risk preferences: The quantile-weighted FSD framework maps directly to economic notions of risk aversion (through SRMs), enabling alignment of AI behavior with stakeholder risk preferences (firms, regulators, users) and direct trade-offs between utility (helpfulness) and downside protection.
  • Deployment and regulation: Methods that provide distributional/risk guarantees strengthen arguments for safer deployment standards and could inform regulatory requirements (e.g., minimum stochastic-dominance safety baselines relative to vetted reference policies).
  • Market design and incentives: Firms may adopt dominance-based alignment to credibly signal lower tail-risk exposure, affecting competitive dynamics, liability exposure, and consumer trust. Conversely, conservative settings may reduce short-term product utility, affecting adoption and monetization.
  • Cost/benefit and calibration: Implementing RAD introduces additional modeling and computation (distribution estimation, OT computations), and requires choosing reference policies and quantile weights — representing economic design choices that trade increased safety (reduced downside risk) against development cost and possible foregone utility.
  • Research and policy priorities: From an economics perspective, quantifying how dominance-based safety reduces expected damages or tail-losses will be critical for cost–benefit analyses, insurance modeling, and setting regulatory thresholds that balance innovation with systemic risk mitigation.

Assessment

Paper Typetheoretical Evidence Strengthmedium — The paper provides formal theoretical guarantees (weighted FSD implies improvements in broad classes of Spectral Risk Measures) and empirical evaluations showing improved harmlessness and OOD robustness; however, empirical evidence is limited to simulated/benchmark RLHF experiments, lacks real-world deployment validation or economic outcome measurements, and depends on choices (reference policy, cost estimator) that could materially affect results. Methods Rigorhigh — The work develops a principled replacement for expectation-based constraints, proves formal connections between quantile-weighted stochastic dominance and classes of risk measures, and operationalizes the idea through entropically regularized Optimal Transport with differentiable Sinkhorn-based gradients; empirical comparisons to RLHF baselines are reported. Remaining rigor concerns are about finite-sample/statistical estimation error of cost distributions and sensitivity to reference-policy and weighting choices, which the paper acknowledges but does not fully characterize experimentally. SampleRLHF setting using policy rollouts to produce empirical cost distributions (safety/harm metrics) for the learned and reference policies; experiments compare RAD to baseline RLHF methods on helpfulness, harmlessness, and out-of-distribution harmlessness benchmarks. (Paper summary does not specify exact dataset names, model sizes, or scale of rollouts in this brief.) Themesgovernance human_ai_collab GeneralizabilityRelies on availability and quality of a scalar cost signal (safety metric) — not all harms are easily quantified., Performance depends on the chosen reference policy and quantile-weighting function; different choices change conservatism and sample efficiency., Empirical tests use benchmark/controlled settings — results may not transfer to complex, real-world deployments and rare catastrophic events., Computational overhead from OT/Sinkhorn and distribution estimation may limit applicability to very large models or low-latency systems., Provable SRM guarantees hinge on accurate estimation of cost distributions; finite-sample noise can weaken guarantees in practice.

Claims (11)

ClaimDirectionConfidenceOutcomeDetails
Standard RLHF expected-cost constraints ignore distributional shape and can fail under heavy tails or rare catastrophic events. Ai Safety And Ethics negative high safety cost distribution properties (tail probability of high-cost/unsafe rollouts)
0.12
First-Order Stochastic Dominance (FSD) constraints compare whole cost distributions and directly constrain tails, offering stronger guarantees against high-cost (unsafe) outcomes than expected-cost constraints. Ai Safety And Ethics positive high cost distribution (CDF/tails), probability mass in high-cost region
0.12
RAD operationalizes FSD by comparing the learned policy’s empirical rollout cost distribution to a reference policy’s distribution using Optimal Transport (OT) with entropic regularization and Sinkhorn iterations. Other positive high computable alignment loss (OT-based distance), differentiability of training objective
0.12
Entropic regularization plus Sinkhorn iterations yields a differentiable, computationally tractable objective suitable for end-to-end optimization with policy gradient methods. Other positive medium differentiability and computational tractability of the alignment objective (gradient availability, optimization compatibility)
0.07
Introducing quantile-weighted FSD (weighted-FSD) provably controls broad classes of Spectral Risk Measures (SRMs): improving weighted-FSD implies guaranteed improvements in the associated SRM. Ai Safety And Ethics positive high Spectral Risk Measures (SRMs) computed from cost distributions
0.12
Weighted-FSD provides a tunable knob to encode risk aversion/preferences by selecting quantile-weighting functions. Ai Safety And Ethics positive high risk profile as measured by SRMs or weighted quantile-based metrics
0.12
Empirically, RAD improves harmlessness relative to baseline RLHF methods. Ai Safety And Ethics positive medium harmlessness metric(s) (e.g., rate of safety violations / harmful outputs)
0.07
Empirically, RAD improves out-of-distribution (OOD) robustness (OOD harmlessness) compared to baselines. Ai Safety And Ethics positive medium OOD harmlessness / robustness (safety under OOD prompts or distribution shifts)
0.07
RAD remains competitive on helpfulness, incurring only modest or no loss in helpfulness in the reported experiments. Output Quality mixed medium helpfulness metric(s) (task performance, reward, human preference scores)
0.07
RAD requires estimating cost distributions and choosing a reference policy and quantile-weighting function; these choices determine the method's conservatism and sample efficiency. Training Effectiveness mixed high method conservatism (relative safety level) and sample efficiency (amount of data needed to estimate cost distributions)
0.12
By better controlling tail risk and rare catastrophic harms, RAD can reduce expected social costs, liability exposure, and insurance premiums associated with high-impact AI failures. Social Protection positive speculative expected social costs / liability exposure / insurance-related risk metrics (not directly measured in reported experiments)
0.01

Notes