A dominance-based RLHF method cuts rare harmful outputs without sacrificing helpfulness. Using entropically regularized optimal transport to enforce stochastic dominance gives tunable, provable control over tail risk that can be aligned with firm or regulator risk preferences.
Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy's cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model's risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.
Summary
Main Finding
Replacing expected-cost constraints in RLHF with First-Order Stochastic Dominance (FSD) constraints — implemented via an entropically regularized Optimal Transport (OT) comparison of cost distributions (the RAD method) — yields a differentiable, computationally tractable alignment objective that better controls tail and out-of-distribution (OOD) risks. A quantile-weighted FSD variant provably controls broad classes of Spectral Risk Measures (SRMs), letting practitioners tune risk profiles; empirically RAD improves harmlessness and OOD robustness while remaining competitive on helpfulness.
Key Points
- Problem with expectations: Standard RLHF enforces safety via expected-cost constraints, which ignore distributional shape and can fail under heavy tails or rare catastrophic events.
- Stochastic dominance alternative: FSD compares whole cost distributions, directly constraining tails and offering stronger guarantees against high-cost (unsafe) outcomes.
- Operationalization (RAD): FSD constraints are implemented by comparing the learned policy’s rollout cost distribution to a reference policy’s distribution using Optimal Transport. Entropic regularization plus Sinkhorn iterations provide a differentiable, efficient objective suitable for end-to-end optimization.
- Quantile weighting and spectral risks: Introducing quantile-weighted FSD lets one control Spectral Risk Measures (SRMs) — improving weighted-FSD implies guaranteed improvements in the associated SRM. This gives a principled knob to encode risk aversion/preferences.
- Empirical trade-offs: RAD improves harmlessness and OOD robustness over baselines, with only modest or no loss in helpfulness in the reported experiments.
- Practical considerations: RAD requires estimating cost distributions and choosing a reference policy and quantile-weighting function; these choices shape conservatism and sample efficiency.
Data & Methods
- Setting: Reinforcement Learning from Human Feedback (RLHF) where policies produce a distribution of costs (safety violations, harmful outputs) across rollouts.
- Constraint replacement: Swap scalar expected-cost constraints for FSD constraints that require the learned policy’s cost distribution to stochastically dominate (be no worse than) a reference policy’s distribution.
- OT-based comparison: Use Optimal Transport to compare empirical cost distributions; apply entropic regularization to the OT objective and compute gradients via Sinkhorn iterations for efficient, differentiable optimization compatible with policy gradient / end-to-end training.
- Quantile-weighted FSD: Define weighted dominance by reweighting quantiles; prove that weighted-FSD dominance implies improvement across the corresponding class of Spectral Risk Measures, giving formal risk guarantees.
- Evaluation (as described): Compare RAD to baseline RLHF methods on helpfulness and harmlessness metrics, including OOD harmlessness evaluations, to assess robustness. (The paper reports RAD improves harmlessness and OOD robustness while remaining competitive on helpfulness.)
Implications for AI Economics
- Risk management and externalities: RAD provides a tractable way to reduce tail risk and rare catastrophic harms from deployed AI systems, which lowers expected social costs, potential liabilities, and insurance premiums associated with high-impact failures.
- Tunable risk preferences: The quantile-weighted FSD framework maps directly to economic notions of risk aversion (through SRMs), enabling alignment of AI behavior with stakeholder risk preferences (firms, regulators, users) and direct trade-offs between utility (helpfulness) and downside protection.
- Deployment and regulation: Methods that provide distributional/risk guarantees strengthen arguments for safer deployment standards and could inform regulatory requirements (e.g., minimum stochastic-dominance safety baselines relative to vetted reference policies).
- Market design and incentives: Firms may adopt dominance-based alignment to credibly signal lower tail-risk exposure, affecting competitive dynamics, liability exposure, and consumer trust. Conversely, conservative settings may reduce short-term product utility, affecting adoption and monetization.
- Cost/benefit and calibration: Implementing RAD introduces additional modeling and computation (distribution estimation, OT computations), and requires choosing reference policies and quantile weights — representing economic design choices that trade increased safety (reduced downside risk) against development cost and possible foregone utility.
- Research and policy priorities: From an economics perspective, quantifying how dominance-based safety reduces expected damages or tail-losses will be critical for cost–benefit analyses, insurance modeling, and setting regulatory thresholds that balance innovation with systemic risk mitigation.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Standard RLHF expected-cost constraints ignore distributional shape and can fail under heavy tails or rare catastrophic events. Ai Safety And Ethics | negative | high | safety cost distribution properties (tail probability of high-cost/unsafe rollouts) |
0.12
|
| First-Order Stochastic Dominance (FSD) constraints compare whole cost distributions and directly constrain tails, offering stronger guarantees against high-cost (unsafe) outcomes than expected-cost constraints. Ai Safety And Ethics | positive | high | cost distribution (CDF/tails), probability mass in high-cost region |
0.12
|
| RAD operationalizes FSD by comparing the learned policy’s empirical rollout cost distribution to a reference policy’s distribution using Optimal Transport (OT) with entropic regularization and Sinkhorn iterations. Other | positive | high | computable alignment loss (OT-based distance), differentiability of training objective |
0.12
|
| Entropic regularization plus Sinkhorn iterations yields a differentiable, computationally tractable objective suitable for end-to-end optimization with policy gradient methods. Other | positive | medium | differentiability and computational tractability of the alignment objective (gradient availability, optimization compatibility) |
0.07
|
| Introducing quantile-weighted FSD (weighted-FSD) provably controls broad classes of Spectral Risk Measures (SRMs): improving weighted-FSD implies guaranteed improvements in the associated SRM. Ai Safety And Ethics | positive | high | Spectral Risk Measures (SRMs) computed from cost distributions |
0.12
|
| Weighted-FSD provides a tunable knob to encode risk aversion/preferences by selecting quantile-weighting functions. Ai Safety And Ethics | positive | high | risk profile as measured by SRMs or weighted quantile-based metrics |
0.12
|
| Empirically, RAD improves harmlessness relative to baseline RLHF methods. Ai Safety And Ethics | positive | medium | harmlessness metric(s) (e.g., rate of safety violations / harmful outputs) |
0.07
|
| Empirically, RAD improves out-of-distribution (OOD) robustness (OOD harmlessness) compared to baselines. Ai Safety And Ethics | positive | medium | OOD harmlessness / robustness (safety under OOD prompts or distribution shifts) |
0.07
|
| RAD remains competitive on helpfulness, incurring only modest or no loss in helpfulness in the reported experiments. Output Quality | mixed | medium | helpfulness metric(s) (task performance, reward, human preference scores) |
0.07
|
| RAD requires estimating cost distributions and choosing a reference policy and quantile-weighting function; these choices determine the method's conservatism and sample efficiency. Training Effectiveness | mixed | high | method conservatism (relative safety level) and sample efficiency (amount of data needed to estimate cost distributions) |
0.12
|
| By better controlling tail risk and rare catastrophic harms, RAD can reduce expected social costs, liability exposure, and insurance premiums associated with high-impact AI failures. Social Protection | positive | speculative | expected social costs / liability exposure / insurance-related risk metrics (not directly measured in reported experiments) |
0.01
|