Deep reinforcement-learning price setters learn to tacitly collude fast: in a simulated continuous-price oligopoly, agents converge to cooperative pricing within empirically realistic timeframes, sustained by reward-punishment schemes that deter deviations.

Convergence to collusion in algorithmic pricing

Kevin Michael Frick · April 17, 2026

arxiv theoretical low evidence 8/10 relevance Source PDF

In simulation, modern deep reinforcement learning agents in a repeated continuous-price oligopoly quickly learn cooperative, collusive pricing supported by reward-punishment strategies, with convergence times that can match empirical observations under plausible time-step assumptions.

Artificial intelligence algorithms are increasingly used by firms to set prices. Previous research shows that they can exhibit collusive behaviour, but how quickly they can do so has so far remained an open question. I show that a modern deep reinforcement learning model deployed to price goods in a repeated oligopolistic competition game with continuous prices converges to a collusive outcome in an amount of time that matches empirical observations, under reasonable assumptions on the length of a time step. This model shows cooperative behaviour supported by reward-punishment schemes that discourage deviations.

Summary

Main Finding

A modern deep reinforcement-learning algorithm (average-reward soft actor-critic, SAC) deployed in a repeated oligopolistic pricing game with continuous prices can autonomously learn and sustain collusive (supracompetitive) prices much faster than tabular Q‑learning. Under the paper’s baseline assumptions the model converges to a collusive outcome in roughly 50,000 periods — about two orders of magnitude faster than the ≈1,000,000 periods reported in Calvano et al. (2020). Learned cooperation is maintained by implicit reward–punishment schemes that discourage deviations. Convergence is not guaranteed (deep RL is unstable), but when it succeeds the timescale is comparable to empirical observations (e.g., Assad et al., 2023).

Key Points

Empirical motivation: Assad et al. (2023) found margin increases in retail gasoline markets over a timescale of years, suggesting algorithmic learning of supra-competitive pricing. Prior experimental/theoretical work (Calvano et al., 2020) implied much longer learning times (hundreds of thousands to a million periods), creating a timescale puzzle.
Why Q‑learning is slow:
- Needs discrete state/action grids (large Q‑matrix) so must visit many cells repeatedly.
- Exploration must be largely random, which prevents learning cooperative punishments since random opponent behavior gives no credible punishment signal.
- Discounting / numerical issues further complicate the tabular approach.
Why SAC (policy-gradient, function approximation) is faster:
- Uses neural networks to approximate critic q(s,a;w) and actor σ(a|s;θ) — works directly in continuous price space and exploits topology of actions/states.
- Batch updates and parameter changes affect many state-action values at once (better sample efficiency).
- Entropy/KL regularisation (soft objective) creates principled, state-dependent exploration (not uniform ε‑greedy), enabling focused exploration around promising policies (including exploring punishments).
- Uses average-reward formulation rather than discounted returns, avoiding ill-posed discount issues when time steps map to short real-world intervals.
Quantitative result: Two or more SAC agents in a Bertrand-style game with logit (deterministic) demand, constant marginal costs, no capacity constraints/entry/exit, and symmetric information tend to reach and sustain a supra-competitive price that is robust to deviations in ≈50,000 periods. If a period = 1 hour, that corresponds to ≈5 years — comparable to observed market evidence.
Caveats:
- Deep RL instability: success rates are imperfect (the literature documents many failed trials in continuous control).
- Results obtained under simplified, favorable assumptions (no stochastic demand/noise, no capacity constraints, symmetric firms, etc.).
- Mapping from abstract periods to real time is an assumption; conclusions depend on that calibration.
Replication: code is available (author’s GitHub link in paper).

Data & Methods

Economic environment:
- Repeated oligopolistic pricing (Bertrand-style) with continuous price choices.
- Demand: logit, non-stochastic in the baseline.
- Firms: constant marginal cost, symmetric, no capacity constraints, no entry/exit, full observability of past prices (state includes past prices).
Algorithmic method:
- Average-reward soft actor-critic (SAC) variant (Haarnoja et al., 2018; Adamczyk et al., 2025) adapted to average-reward/differential value functions.
- Actor–critic networks:
  - Critic q(s,a;w) trained to fit observed action values (with average-reward and entropy term).
  - Actor σ(a|s;θ) trained by policy gradient to maximize q − α log σ (entropy-regularized objective).
  - Temperature α (entropy weight) is iteratively adjusted to meet a target KL/entropy constraint (state-dependent exploration intensity).
- Exploration: implicit via entropy/KL regularisation (encourages randomness but allows lower exploration in states with clear high-value actions).
- Objective: maximize steady-state average profit ¯π(σ) (average-reward criterion) and estimate differential value functions vσ and qσ in that framework.
- Implementation details and technical derivations in paper appendices (A & B); code available at author’s repository.
Comparisons:
- Contrasted to tabular Q‑learning (Calvano et al., 2020): Q‑learning required discretisation (e.g., 3,375 state-action cells in cited baseline) and very large sample sizes to visit cells enough times, producing extremely long convergence times.
Experimental metrics:
- Convergence times measured in periods to reach and sustain supra-competitive prices and robustness to unilateral deviation.
- Sensitivity checks discussed qualitatively; main baseline uses deterministic logit demand and symmetric firms.

Implications for AI Economics

Antitrust & policy relevance:
- Modern multi-agent RL methods can learn collusive pricing at empirically relevant timescales without explicit coordination or communication among firms. This strengthens concerns about algorithmic pricing as a threat to competition.
- Enforcement/detection becomes harder: collusion arises from learning dynamics and implicit reward–punishment strategies, not from explicit agreements.
- Regulatory implications: consider rules on algorithmic pricing deployment (e.g., auditability, transparency, limits on dynamic pricing information sharing, constraints on ability to condition on competitors’ prices), and investigate mandatory safe‑guards in pricing ML systems.
Research directions:
- Need to test robustness: heterogenous firms, stochastic/noisy demand, capacity constraints, entry/exit, partial observability, real data frequencies, and market complexity.
- Evaluate detection methods that can distinguish learned collusion from competitive pricing or price parallelism due to common shocks or cost changes.
- Study the distribution of outcomes across randomized trials and environments (success rates, failure modes) to quantify practical risks.
- Explore algorithmic design interventions that prevent collusion (e.g., limiting state information, restricting conditioning on rivals’ actions, regulatory constraints on learning objectives).
Theory:
- Reinforces that repeated‑game folk-theorem–like outcomes can emerge through learning algorithms implementing reward–punishment strategies even when agents are not explicitly “rational” in the classical sense.
- Suggests incorporating algorithmic learning dynamics and sample-efficiency considerations into antitrust theory and empirical models of market conduct.
Methodological:
- Experimental and theoretical work on algorithmic collusion should move beyond tabular Q‑learning as a benchmark; modern policy-gradient, function-approximation methods can yield qualitatively different (and faster) dynamics.
- Careful mapping from simulation periods to real time is crucial when drawing policy conclusions from lab/algorithmic experiments.

Overall, the paper shows that state-of-the-art deep RL (average-reward SAC) can narrow the gap between laboratory/theoretical convergence times and observed market patterns of margin increases following algorithm adoption, reinforcing policy concerns about algorithmic collusion while highlighting important limitations and directions for further robustness work.

Assessment

Paper Typetheoretical Evidence Strengthlow — Findings are based on simulations of deep reinforcement learning agents in stylized repeated-oligopoly games rather than on real-world firm-level or field experimental data; while the simulation matches empirical timing under assumptions, it cannot establish causal effects in markets without external validation. Methods Rigormedium — Uses contemporary deep reinforcement learning methods and analyzes emergent reward-punishment strategies in a continuous-price repeated game, which is appropriate for the research question; however, rigor is limited by reliance on model specification choices, sensitivity to hyperparameters, and absence of robustness checks or empirical validation reported in the abstract. SampleSimulated environment consisting of a repeated oligopolistic competition game with continuous prices populated by modern deep reinforcement learning agents; analysis focuses on convergence time to collusive outcomes under specified time-step length and reward structures (details such as number of firms, exact algorithm/hyperparameters, and demand specification are not provided in the abstract). Themesgovernance adoption GeneralizabilitySimulation results may not generalize to real-world firms with differing objectives, constraints, and informational frictions, Outcome depends on model assumptions (reward functions, observation structure, ability to punish) and chosen hyperparameters, Assumes particular time-step length mapping to real-world decision intervals which may not hold across industries, Simplified market environment (no demand shocks, heterogenous products, capacity limits, entry/exit, or regulatory constraints), Small-number oligopoly setting may not extend to markets with many firms or frequent entry

Claims (4)

Claim	Direction	Confidence	Outcome	Details
Artificial intelligence algorithms are increasingly used by firms to set prices. Adoption Rate	positive	medium	use/adoption of AI algorithms for pricing by firms	0.04
Previous research shows that [pricing] algorithms can exhibit collusive behaviour. Market Structure	positive	high	occurrence of collusive behaviour by pricing algorithms	0.12
A modern deep reinforcement learning model deployed to price goods in a repeated oligopolistic competition game with continuous prices converges to a collusive outcome in an amount of time that matches empirical observations (under reasonable assumptions on the length of a time step). Task Completion Time	positive	high	time to converge to a collusive pricing outcome	0.12
The model shows cooperative behaviour supported by reward-punishment schemes that discourage deviations. Market Structure	positive	high	presence of cooperative behaviour and mechanisms (reward-punishment) that deter deviation	0.12