Optimizing for average reward can hide persistent failure: when environment dynamics are non-ergodic a policy that looks optimal in expectation can lock a deployed agent into long-run low-reward regimes, so evaluators and regulators should prioritize time-average or distributional guarantees rather than ensemble means.

Ergodicity in reinforcement learning

Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön · March 11, 2026

arxiv theoretical n/a evidence 8/10 relevance Source PDF

Maximizing expected cumulative reward can mislead in non-ergodic environments because ensemble expectations can diverge from the time-average a single deployed agent experiences, so objectives and evaluations should target long-run, sample-path performance or distributional guarantees.

In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.

Summary

Main Finding

Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic: the ensemble expectation does not generally equal the time-average experienced by a single deployed agent. The paper demonstrates this gap with an instructive example, formalizes the relation between non-ergodic reward processes and standard notions of ergodicity for Markov chains, and surveys solution approaches that instead target long-run, single-trajectory performance.

Key Points

Problem statement
- Standard RL objective: maximize expected sum (or discounted sum) of rewards over trajectories.
- In non-ergodic environments, averaging over many trajectories (ensemble average) can differ dramatically from the time-average reward observed along a single, long trajectory.
- If deployment value is the time-average for one agent, the usual expected-value objective can lead to poor real-world outcomes.
Ergodicity and its consequences
- Ergodic (reward) process: time averages along almost every long trajectory converge to the same value as the ensemble average.
- Non-ergodic processes admit path-dependent long-run behavior (e.g., absorbing sets, multiple invariant measures, path-dependent reinforcement) so different runs with the same policy can have different long-run averages.
- Standard Markov chain ergodicity conditions (irreducibility, positive recurrence, aperiodicity) imply ergodic reward processes when rewards depend only on chain state; lack of those properties can produce non-ergodicity.
Illustrative example (paper-provided)
- The paper uses an instructive example to show how a policy that maximizes expected reward can produce trajectories that lock into high- or low-reward regimes, so an agent’s long-term realized reward is highly uncertain and not captured by the expectation.
Existing solution approaches surveyed
- Risk-sensitive and utility-based objectives: maximize expected utility (e.g., log-utility) or minimize downside risk so policies prefer more reliable time-average outcomes.
- Distributional RL: optimize whole reward distribution, enabling objectives like median, lower quantiles, or CVaR which better reflect single-run guarantees.
- Almost-sure and probabilistic constraints: enforce that long-run performance exceeds thresholds with high probability (chance constraints, safe RL).
- Ergodic control and sample-path optimality: frame control objectives in terms of time averages or sample-path criteria rather than ensemble expectations.
- Robust/adversarial and model-uncertainty methods: hedge against trajectories that lead to poor long-run behavior.
- Structural fixes: modify environment or policy so chain becomes ergodic (e.g., ensuring mixing/recurrence, avoid absorbing bad states).

Data & Methods

Analysis type: theoretical exposition plus an illustrative example rather than empirical large-scale experiments.
Methods used
- Constructive example demonstrating divergence of ensemble vs. time averages under a simple non-ergodic reward process.
- Formal discussion mapping reward-process ergodicity to ergodicity concepts for Markov chains (invariant measures, recurrence, irreducibility, aperiodicity).
- Survey of existing algorithmic and theoretical approaches from the RL literature that address long-run, single-trajectory objectives (risk-sensitive RL, distributional RL, ergodic control, chance-constrained and robust methods).
Evidence: reasoning and example-driven argumentation showing how standard objective mismatch can produce undesirable deployment outcomes; pointers to existing solution literature rather than new large-scale empirical validation.

Implications for AI Economics

Evaluation and deployment risk
- Economic value in deployment is often the realized time-average for an individual agent (e.g., firm profits, user engagement over time). Using ensemble expectations for policy selection can misstate economic value and risk.
- Policies optimized for expectation may hide high variance and tail risks that matter for stakeholders; regulators and decision-makers should prefer objectives that reflect single-run guarantees when relevant.
Incentive design and contracts
- Contracts, incentives, and compensation that rely on expected performance can incentivize strategies that deliver high expected returns but poor or unreliable time-average outcomes. Designing incentives should account for path-dependent risks and prefer mechanisms that reward reliable, long-run performance.
Investment and adoption decisions
- Firms deciding whether to deploy an AI system should assess time-average performance distributions (e.g., median, worst-case long-run outcomes), not just expected returns, particularly when non-ergodic dynamics (locking-in, absorbing bad states) are possible.
Policy and regulation
- Regulators concerned with systemic harms or consumer protection may require guarantees on long-run performance (probabilistic or almost-sure bounds) rather than average-case metrics.
- Ensuring environments/policies meet ergodicity-like conditions (or explicitly addressing non-ergodicity) can be a design target for safer, more predictable AI deployment.
Research agenda for AI economics
- Develop evaluation metrics and benchmarks oriented to time-average and sample-path guarantees.
- Study market and strategic interactions when agents optimize different objectives (expectation vs. time-average) and how that affects welfare and systemic risk.
- Incorporate non-ergodicity-aware objectives into economic models of AI adoption, investment under uncertainty, and regulatory design.

Assessment

Paper Typetheoretical Evidence Strengthn/a — Paper is theoretical: it provides formal arguments and a constructed illustrative example rather than empirical or experimental evidence, so there is no causal identification to rate. Methods Rigorhigh — Uses a constructive, transparent example together with formal mapping between reward-process ergodicity and standard Markov-chain ergodicity concepts (invariant measures, recurrence, aperiodicity), and surveys relevant RL solution approaches; the arguments are mathematically grounded though not empirically validated. SampleNo empirical sample; analysis is based on mathematical argumentation and a constructed illustrative Markov-chain / reward-process example that demonstrates divergence between ensemble expectations and single-trajectory time averages. Themesgovernance adoption productivity GeneralizabilityConstructive example is simple and may not capture the full complexity of real-world deployment environments (high-dimensional state spaces, partial observability, non-Markovian dynamics)., Assumes reward dynamics can be modeled as Markov chains with state-dependent rewards; results may need extension for partially observed or non-Markovian settings., Provides conceptual and theoretical guidance but lacks empirical validation on real deployed systems or large-scale simulations., Recommendations (e.g., enforcing ergodicity) may be difficult or costly to implement in many practical systems or economic settings.

Claims (17)

Claim	Direction	Confidence	Outcome	Details
Optimizing the expected cumulative reward (ensemble average across trajectories) can be misleading when reward-generating dynamics are non-ergodic because the ensemble expectation does not generally equal the time-average experienced by a single deployed agent. Decision Quality	negative	high	expected cumulative reward (ensemble expectation) vs. time-average realized reward on a single long trajectory	0.02
If deployment value is the time-average for one agent, optimizing the usual expected-value objective can lead to poor real-world outcomes. Decision Quality	negative	high	realized long-run (time-average) reward of deployed agent	0.02
Ergodic reward processes are those where time averages along almost every long trajectory converge to the same value as the ensemble average. Decision Quality	mixed	high	convergence of time-average reward to ensemble average	0.02
Non-ergodic processes admit path-dependent long-run behavior (e.g., absorbing sets, multiple invariant measures, path-dependent reinforcement), so different runs with the same policy can have different long-run averages. Decision Quality	mixed	high	variance across realized long-run average rewards across trajectories under the same policy	0.02
Standard Markov chain ergodicity conditions (irreducibility, positive recurrence, aperiodicity) imply ergodic reward processes when rewards depend only on the chain state. Decision Quality	mixed	high	ergodicity of reward process (equivalence to chain ergodicity when rewards are state-dependent)	0.02
Absence of irreducibility, positive recurrence, or aperiodicity in the state dynamics can produce non-ergodic reward behavior. Decision Quality	mixed	high	presence of non-ergodic long-run reward behavior (e.g., multiple invariant measures, absorbing states)	0.02
The paper's illustrative example shows a policy that maximizes expected reward can produce trajectories that lock into high- or low-reward regimes so an agent’s long-term realized reward is highly uncertain and not captured by the expectation. Decision Quality	negative	medium	distribution (uncertainty) of long-term realized reward across individual trajectories for a policy that maximizes expected reward	0.01
Risk-sensitive and utility-based objectives (e.g., maximize expected utility such as log-utility or minimize downside risk) can produce policies that prefer more reliable time-average outcomes compared to raw expected-reward objectives. Decision Quality	positive	medium	time-average reliability or downside risk of realized reward under risk-sensitive/utility-based policies	0.01
Distributional reinforcement learning (optimizing the full return distribution) enables optimizing objectives such as median, lower quantiles, or CVaR which better reflect single-run guarantees. Decision Quality	positive	medium	statistics of the return distribution (median, quantiles, CVaR) relevant to single-run guarantees	0.01
Almost-sure and probabilistic constraint methods (chance constraints, safe RL) can enforce that long-run performance exceeds thresholds with high probability, addressing single-trajectory guarantees. Decision Quality	positive	medium	probability that long-run/time-average performance exceeds a threshold (chance constraint satisfaction)	0.01
Ergodic control and sample-path optimality formulations recast control objectives in terms of time averages or almost-sure sample-path criteria rather than ensemble expectations and are therefore appropriate for single-trajectory performance targets. Decision Quality	positive	medium	time-average/sample-path optimality of control policies	0.01
Robust/adversarial and model-uncertainty methods can hedge against trajectories that lead to poor long-run behavior and thus mitigate risks from non-ergodic dynamics. Decision Quality	positive	medium	worst-case or adversarial long-run reward under uncertainty	0.01
Structural fixes — altering environment design or policy class to ensure the induced Markov chain is ergodic (e.g., ensuring mixing/recurrence or preventing absorbing bad states) — can eliminate the ensemble/time-average gap. Decision Quality	positive	medium	ergodicity of induced dynamics and resulting alignment of ensemble and time-average rewards	0.01
The paper does not present large-scale empirical validation; its evidence is primarily theoretical exposition, a constructed illustrative example, and a literature survey. Research Productivity	null_result	high	presence/absence of empirical experiments or sample-based validation	0.02
Economic evaluations and deployment decisions that rely on ensemble expectations can misstate economic value and risk because firms and users experience single time-averaged trajectories; regulators and decision-makers should therefore prefer objectives reflecting single-run guarantees when relevant. Governance And Regulation	negative	medium	accuracy of economic valuation and risk assessment when using ensemble expectation vs. time-average metrics	0.01
Contracts and incentives based on expected performance can incentivize strategies that deliver high expected returns but poor or unreliable time-average outcomes; incentive design should account for path-dependent risks. Governance And Regulation	negative	medium	alignment/misalignment of incentives with reliable long-run (time-average) performance	0.01
Research agenda recommendations: develop evaluation metrics and benchmarks oriented to time-average and sample-path guarantees; study market/strategic interactions when agents optimize different objectives; incorporate non-ergodicity-aware objectives into economic models of AI adoption and regulation. Research Productivity	positive	speculative	future research outputs (metrics, benchmarks, models) and their relevance to time-average guarantees	0.0