Optimizing for average reward can hide persistent failure: when environment dynamics are non-ergodic a policy that looks optimal in expectation can lock a deployed agent into long-run low-reward regimes, so evaluators and regulators should prioritize time-average or distributional guarantees rather than ensemble means.
In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.
Summary
Main Finding
Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic: the ensemble expectation does not generally equal the time-average experienced by a single deployed agent. The paper demonstrates this gap with an instructive example, formalizes the relation between non-ergodic reward processes and standard notions of ergodicity for Markov chains, and surveys solution approaches that instead target long-run, single-trajectory performance.
Key Points
-
Problem statement
- Standard RL objective: maximize expected sum (or discounted sum) of rewards over trajectories.
- In non-ergodic environments, averaging over many trajectories (ensemble average) can differ dramatically from the time-average reward observed along a single, long trajectory.
- If deployment value is the time-average for one agent, the usual expected-value objective can lead to poor real-world outcomes.
-
Ergodicity and its consequences
- Ergodic (reward) process: time averages along almost every long trajectory converge to the same value as the ensemble average.
- Non-ergodic processes admit path-dependent long-run behavior (e.g., absorbing sets, multiple invariant measures, path-dependent reinforcement) so different runs with the same policy can have different long-run averages.
- Standard Markov chain ergodicity conditions (irreducibility, positive recurrence, aperiodicity) imply ergodic reward processes when rewards depend only on chain state; lack of those properties can produce non-ergodicity.
-
Illustrative example (paper-provided)
- The paper uses an instructive example to show how a policy that maximizes expected reward can produce trajectories that lock into high- or low-reward regimes, so an agent’s long-term realized reward is highly uncertain and not captured by the expectation.
-
Existing solution approaches surveyed
- Risk-sensitive and utility-based objectives: maximize expected utility (e.g., log-utility) or minimize downside risk so policies prefer more reliable time-average outcomes.
- Distributional RL: optimize whole reward distribution, enabling objectives like median, lower quantiles, or CVaR which better reflect single-run guarantees.
- Almost-sure and probabilistic constraints: enforce that long-run performance exceeds thresholds with high probability (chance constraints, safe RL).
- Ergodic control and sample-path optimality: frame control objectives in terms of time averages or sample-path criteria rather than ensemble expectations.
- Robust/adversarial and model-uncertainty methods: hedge against trajectories that lead to poor long-run behavior.
- Structural fixes: modify environment or policy so chain becomes ergodic (e.g., ensuring mixing/recurrence, avoid absorbing bad states).
Data & Methods
- Analysis type: theoretical exposition plus an illustrative example rather than empirical large-scale experiments.
- Methods used
- Constructive example demonstrating divergence of ensemble vs. time averages under a simple non-ergodic reward process.
- Formal discussion mapping reward-process ergodicity to ergodicity concepts for Markov chains (invariant measures, recurrence, irreducibility, aperiodicity).
- Survey of existing algorithmic and theoretical approaches from the RL literature that address long-run, single-trajectory objectives (risk-sensitive RL, distributional RL, ergodic control, chance-constrained and robust methods).
- Evidence: reasoning and example-driven argumentation showing how standard objective mismatch can produce undesirable deployment outcomes; pointers to existing solution literature rather than new large-scale empirical validation.
Implications for AI Economics
-
Evaluation and deployment risk
- Economic value in deployment is often the realized time-average for an individual agent (e.g., firm profits, user engagement over time). Using ensemble expectations for policy selection can misstate economic value and risk.
- Policies optimized for expectation may hide high variance and tail risks that matter for stakeholders; regulators and decision-makers should prefer objectives that reflect single-run guarantees when relevant.
-
Incentive design and contracts
- Contracts, incentives, and compensation that rely on expected performance can incentivize strategies that deliver high expected returns but poor or unreliable time-average outcomes. Designing incentives should account for path-dependent risks and prefer mechanisms that reward reliable, long-run performance.
-
Investment and adoption decisions
- Firms deciding whether to deploy an AI system should assess time-average performance distributions (e.g., median, worst-case long-run outcomes), not just expected returns, particularly when non-ergodic dynamics (locking-in, absorbing bad states) are possible.
-
Policy and regulation
- Regulators concerned with systemic harms or consumer protection may require guarantees on long-run performance (probabilistic or almost-sure bounds) rather than average-case metrics.
- Ensuring environments/policies meet ergodicity-like conditions (or explicitly addressing non-ergodicity) can be a design target for safer, more predictable AI deployment.
-
Research agenda for AI economics
- Develop evaluation metrics and benchmarks oriented to time-average and sample-path guarantees.
- Study market and strategic interactions when agents optimize different objectives (expectation vs. time-average) and how that affects welfare and systemic risk.
- Incorporate non-ergodicity-aware objectives into economic models of AI adoption, investment under uncertainty, and regulatory design.
Assessment
Claims (17)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic because the ensemble expectation does not generally equal the time-average experienced by a single deployed agent. Decision Quality | negative | high | expected cumulative reward (ensemble expectation) vs. time-average realized reward on a single long trajectory |
0.02
|
| If deployment value is the time-average for one agent, optimizing the usual expected-value objective can lead to poor real-world outcomes. Decision Quality | negative | high | realized long-run (time-average) reward of deployed agent |
0.02
|
| Ergodic reward processes are those where time averages along almost every long trajectory converge to the same value as the ensemble average. Decision Quality | mixed | high | convergence of time-average reward to ensemble average |
0.02
|
| Non-ergodic processes admit path-dependent long-run behavior (e.g., absorbing sets, multiple invariant measures, path-dependent reinforcement), so different runs with the same policy can have different long-run averages. Decision Quality | mixed | high | variance across realized long-run average rewards across trajectories under the same policy |
0.02
|
| Standard Markov chain ergodicity conditions (irreducibility, positive recurrence, aperiodicity) imply ergodic reward processes when rewards depend only on the chain state. Decision Quality | mixed | high | ergodicity of reward process (equivalence to chain ergodicity when rewards are state-dependent) |
0.02
|
| Absence of irreducibility, positive recurrence, or aperiodicity in the state dynamics can produce non-ergodic reward behavior. Decision Quality | mixed | high | presence of non-ergodic long-run reward behavior (e.g., multiple invariant measures, absorbing states) |
0.02
|
| The paper's illustrative example shows a policy that maximizes expected reward can produce trajectories that lock into high- or low-reward regimes so an agent’s long-term realized reward is highly uncertain and not captured by the expectation. Decision Quality | negative | medium | distribution (uncertainty) of long-term realized reward across individual trajectories for a policy that maximizes expected reward |
0.01
|
| Risk-sensitive and utility-based objectives (e.g., maximize expected utility such as log-utility or minimize downside risk) can produce policies that prefer more reliable time-average outcomes compared to raw expected-reward objectives. Decision Quality | positive | medium | time-average reliability or downside risk of realized reward under risk-sensitive/utility-based policies |
0.01
|
| Distributional reinforcement learning (optimizing the full return distribution) enables optimizing objectives such as median, lower quantiles, or CVaR which better reflect single-run guarantees. Decision Quality | positive | medium | statistics of the return distribution (median, quantiles, CVaR) relevant to single-run guarantees |
0.01
|
| Almost-sure and probabilistic constraint methods (chance constraints, safe RL) can enforce that long-run performance exceeds thresholds with high probability, addressing single-trajectory guarantees. Decision Quality | positive | medium | probability that long-run/time-average performance exceeds a threshold (chance constraint satisfaction) |
0.01
|
| Ergodic control and sample-path optimality formulations recast control objectives in terms of time averages or almost-sure sample-path criteria rather than ensemble expectations and are therefore appropriate for single-trajectory performance targets. Decision Quality | positive | medium | time-average/sample-path optimality of control policies |
0.01
|
| Robust/adversarial and model-uncertainty methods can hedge against trajectories that lead to poor long-run behavior and thus mitigate risks from non-ergodic dynamics. Decision Quality | positive | medium | worst-case or adversarial long-run reward under uncertainty |
0.01
|
| Structural fixes — altering environment design or policy class to ensure the induced Markov chain is ergodic (e.g., ensuring mixing/recurrence or preventing absorbing bad states) — can eliminate the ensemble/time-average gap. Decision Quality | positive | medium | ergodicity of induced dynamics and resulting alignment of ensemble and time-average rewards |
0.01
|
| The paper does not present large-scale empirical validation; its evidence is primarily theoretical exposition, a constructed illustrative example, and a literature survey. Research Productivity | null_result | high | presence/absence of empirical experiments or sample-based validation |
0.02
|
| Economic evaluations and deployment decisions that rely on ensemble expectations can misstate economic value and risk because firms and users experience single time-averaged trajectories; regulators and decision-makers should therefore prefer objectives reflecting single-run guarantees when relevant. Governance And Regulation | negative | medium | accuracy of economic valuation and risk assessment when using ensemble expectation vs. time-average metrics |
0.01
|
| Contracts and incentives based on expected performance can incentivize strategies that deliver high expected returns but poor or unreliable time-average outcomes; incentive design should account for path-dependent risks. Governance And Regulation | negative | medium | alignment/misalignment of incentives with reliable long-run (time-average) performance |
0.01
|
| Research agenda recommendations: develop evaluation metrics and benchmarks oriented to time-average and sample-path guarantees; study market/strategic interactions when agents optimize different objectives; incorporate non-ergodicity-aware objectives into economic models of AI adoption and regulation. Research Productivity | positive | speculative | future research outputs (metrics, benchmarks, models) and their relevance to time-average guarantees |
0.0
|