Deep reinforcement-learning price setters learn to tacitly collude fast: in a simulated continuous-price oligopoly, agents converge to cooperative pricing within empirically realistic timeframes, sustained by reward-punishment schemes that deter deviations.
Artificial intelligence algorithms are increasingly used by firms to set prices. Previous research shows that they can exhibit collusive behaviour, but how quickly they can do so has so far remained an open question. I show that a modern deep reinforcement learning model deployed to price goods in a repeated oligopolistic competition game with continuous prices converges to a collusive outcome in an amount of time that matches empirical observations, under reasonable assumptions on the length of a time step. This model shows cooperative behaviour supported by reward-punishment schemes that discourage deviations.
Summary
Main Finding
A modern deep reinforcement-learning algorithm (average-reward soft actor-critic, SAC) deployed in a repeated oligopolistic pricing game with continuous prices can autonomously learn and sustain collusive (supracompetitive) prices much faster than tabular Q‑learning. Under the paper’s baseline assumptions the model converges to a collusive outcome in roughly 50,000 periods — about two orders of magnitude faster than the ≈1,000,000 periods reported in Calvano et al. (2020). Learned cooperation is maintained by implicit reward–punishment schemes that discourage deviations. Convergence is not guaranteed (deep RL is unstable), but when it succeeds the timescale is comparable to empirical observations (e.g., Assad et al., 2023).
Key Points
- Empirical motivation: Assad et al. (2023) found margin increases in retail gasoline markets over a timescale of years, suggesting algorithmic learning of supra-competitive pricing. Prior experimental/theoretical work (Calvano et al., 2020) implied much longer learning times (hundreds of thousands to a million periods), creating a timescale puzzle.
- Why Q‑learning is slow:
- Needs discrete state/action grids (large Q‑matrix) so must visit many cells repeatedly.
- Exploration must be largely random, which prevents learning cooperative punishments since random opponent behavior gives no credible punishment signal.
- Discounting / numerical issues further complicate the tabular approach.
- Why SAC (policy-gradient, function approximation) is faster:
- Uses neural networks to approximate critic q(s,a;w) and actor σ(a|s;θ) — works directly in continuous price space and exploits topology of actions/states.
- Batch updates and parameter changes affect many state-action values at once (better sample efficiency).
- Entropy/KL regularisation (soft objective) creates principled, state-dependent exploration (not uniform ε‑greedy), enabling focused exploration around promising policies (including exploring punishments).
- Uses average-reward formulation rather than discounted returns, avoiding ill-posed discount issues when time steps map to short real-world intervals.
- Quantitative result: Two or more SAC agents in a Bertrand-style game with logit (deterministic) demand, constant marginal costs, no capacity constraints/entry/exit, and symmetric information tend to reach and sustain a supra-competitive price that is robust to deviations in ≈50,000 periods. If a period = 1 hour, that corresponds to ≈5 years — comparable to observed market evidence.
- Caveats:
- Deep RL instability: success rates are imperfect (the literature documents many failed trials in continuous control).
- Results obtained under simplified, favorable assumptions (no stochastic demand/noise, no capacity constraints, symmetric firms, etc.).
- Mapping from abstract periods to real time is an assumption; conclusions depend on that calibration.
- Replication: code is available (author’s GitHub link in paper).
Data & Methods
- Economic environment:
- Repeated oligopolistic pricing (Bertrand-style) with continuous price choices.
- Demand: logit, non-stochastic in the baseline.
- Firms: constant marginal cost, symmetric, no capacity constraints, no entry/exit, full observability of past prices (state includes past prices).
- Algorithmic method:
- Average-reward soft actor-critic (SAC) variant (Haarnoja et al., 2018; Adamczyk et al., 2025) adapted to average-reward/differential value functions.
- Actor–critic networks:
- Critic q(s,a;w) trained to fit observed action values (with average-reward and entropy term).
- Actor σ(a|s;θ) trained by policy gradient to maximize q − α log σ (entropy-regularized objective).
- Temperature α (entropy weight) is iteratively adjusted to meet a target KL/entropy constraint (state-dependent exploration intensity).
- Exploration: implicit via entropy/KL regularisation (encourages randomness but allows lower exploration in states with clear high-value actions).
- Objective: maximize steady-state average profit ¯π(σ) (average-reward criterion) and estimate differential value functions vσ and qσ in that framework.
- Implementation details and technical derivations in paper appendices (A & B); code available at author’s repository.
- Comparisons:
- Contrasted to tabular Q‑learning (Calvano et al., 2020): Q‑learning required discretisation (e.g., 3,375 state-action cells in cited baseline) and very large sample sizes to visit cells enough times, producing extremely long convergence times.
- Experimental metrics:
- Convergence times measured in periods to reach and sustain supra-competitive prices and robustness to unilateral deviation.
- Sensitivity checks discussed qualitatively; main baseline uses deterministic logit demand and symmetric firms.
Implications for AI Economics
- Antitrust & policy relevance:
- Modern multi-agent RL methods can learn collusive pricing at empirically relevant timescales without explicit coordination or communication among firms. This strengthens concerns about algorithmic pricing as a threat to competition.
- Enforcement/detection becomes harder: collusion arises from learning dynamics and implicit reward–punishment strategies, not from explicit agreements.
- Regulatory implications: consider rules on algorithmic pricing deployment (e.g., auditability, transparency, limits on dynamic pricing information sharing, constraints on ability to condition on competitors’ prices), and investigate mandatory safe‑guards in pricing ML systems.
- Research directions:
- Need to test robustness: heterogenous firms, stochastic/noisy demand, capacity constraints, entry/exit, partial observability, real data frequencies, and market complexity.
- Evaluate detection methods that can distinguish learned collusion from competitive pricing or price parallelism due to common shocks or cost changes.
- Study the distribution of outcomes across randomized trials and environments (success rates, failure modes) to quantify practical risks.
- Explore algorithmic design interventions that prevent collusion (e.g., limiting state information, restricting conditioning on rivals’ actions, regulatory constraints on learning objectives).
- Theory:
- Reinforces that repeated‑game folk-theorem–like outcomes can emerge through learning algorithms implementing reward–punishment strategies even when agents are not explicitly “rational” in the classical sense.
- Suggests incorporating algorithmic learning dynamics and sample-efficiency considerations into antitrust theory and empirical models of market conduct.
- Methodological:
- Experimental and theoretical work on algorithmic collusion should move beyond tabular Q‑learning as a benchmark; modern policy-gradient, function-approximation methods can yield qualitatively different (and faster) dynamics.
- Careful mapping from simulation periods to real time is crucial when drawing policy conclusions from lab/algorithmic experiments.
Overall, the paper shows that state-of-the-art deep RL (average-reward SAC) can narrow the gap between laboratory/theoretical convergence times and observed market patterns of margin increases following algorithm adoption, reinforcing policy concerns about algorithmic collusion while highlighting important limitations and directions for further robustness work.
Assessment
Claims (4)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Artificial intelligence algorithms are increasingly used by firms to set prices. Adoption Rate | positive | medium | use/adoption of AI algorithms for pricing by firms |
0.04
|
| Previous research shows that [pricing] algorithms can exhibit collusive behaviour. Market Structure | positive | high | occurrence of collusive behaviour by pricing algorithms |
0.12
|
| A modern deep reinforcement learning model deployed to price goods in a repeated oligopolistic competition game with continuous prices converges to a collusive outcome in an amount of time that matches empirical observations (under reasonable assumptions on the length of a time step). Task Completion Time | positive | high | time to converge to a collusive pricing outcome |
0.12
|
| The model shows cooperative behaviour supported by reward-punishment schemes that discourage deviations. Market Structure | positive | high | presence of cooperative behaviour and mechanisms (reward-punishment) that deter deviation |
0.12
|