Contracts and neutral mediators restore cooperation between powerful LLM agents, while repeated play and reputations often do not; cooperation that appears under repeated interaction collapses when partners change, but mechanisms that enforce contingent payments or delegate decisions sustain cooperative equilibria.

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Emanuel Tewolde, Xiao Zhang, David Guzman Piedrahita, Vincent Conitzer, Zhijing Jin · April 16, 2026

arxiv quasi_experimental medium evidence 8/10 relevance Source PDF

In simulated social-dilemma games, third-party mediation and enforceable contracts produce robust cooperation among capable LLM agents, whereas repetition and simple reputation schemes often fail—especially when co-players vary.

It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents _in equilibrium_. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become _more effective_ under evolutionary pressures to maximize individual payoffs.

Summary

Main Finding

LLM agents, including recent reasoning-enabled models, consistently defect in single‑shot social dilemmas. However, game‑theoretically grounded cooperation mechanisms differ sharply in their ability to induce cooperation among heterogeneous LLMs: mediated decision‑making and enforceable contracts are the most effective, repetition‑based cooperation is fragile (especially under changing co‑players), and all mechanisms become more effective when societies are subject to evolutionary selection pressures that favor higher payoffs. The authors also prove (Theorem 1) that each mechanism can implement Pareto‑improvements over a base game’s Nash equilibrium as a subgame‑perfect equilibrium.

Key Points

Baseline behavior: Across a variety of social dilemmas, modern LLMs defect in single‑shot settings regardless of size or whether reasoning (chain‑of‑thought) is enabled.
Mechanisms compared: Repetition (iterated play), Reputation (history sharing under rematching), Mediation (delegation to a public third‑party mediator), and Contract (outcome‑conditional inter‑player transfers). A no‑op baseline and a coordination game are included.
Relative effectiveness:
- Contracting and Mediation: most effective at producing cooperative, Pareto‑improving outcomes among capable LLMs.
- Reputation: can help but effectiveness depends on the kind/amount of history available and on rematching structure.
- Repetition: can sustain direct reciprocity but cooperation deteriorates markedly when co‑players vary or population heterogeneity is high.
Evolutionary dynamics: Simulating replicator dynamics (selection for higher payoffs) increases the prevalence of cooperative strategies under the mechanisms—suggesting robustness to stronger, selfish models.
Agent reasoning: LLMs’ decisions and their chain‑of‑thought justifications are largely consistent with self‑interested utility maximization and strategic equilibrium reasoning; they understand when cooperation is instrumentally optimal under mechanisms.
Practical note: Gemini 3 family performed best among tested models.
Reproducibility: Benchmark and code released (GitHub: https://github.com/Xiao215/CoopEval).

Data & Methods

Games: Four canonical social dilemmas were used—Prisoner’s Dilemma, Traveler’s Dilemma, Trust Game (simultaneous variation), and a 3‑player Public Goods game—plus a coordination/cooperation baseline.
Mechanism implementations: Standard game‑theoretic formulations:
- Repetition: stochastic continuation (discounting) to allow direct reciprocity.
- Reputation: rematching with visibility into co‑players’ past interactions (first/higher‑order history variants considered).
- Mediation: public mediator whose plan is known; players can choose to delegate and the mediator acts conditional on who delegated.
- Contract: conditional zero‑sum transfers between players contingent on actions.
LLM agents: Six diverse LLMs of varying capabilities were evaluated (including Gemini 3 variants). Models were tested in exhaustive cross‑play (all pairings) to form heterogeneous populations.
Evaluation metrics:
- Average payoffs per mechanism × game × population pairing.
- Evolutionary analysis via replicator dynamics to approximate how populations adapt under payoff‑maximizing selection.
- Deviation ratings / rankings (how attractive is unilateral deviation).
- Qualitative and quantitative analysis of model-generated chain‑of‑thought using an LLM judge to assess rationales for actions.
Scale: The benchmark comprises 20+ cooperation problems resulting from the factorized combinations of games and mechanisms (full code and experimental details in the repository).

Implications for AI Economics

Mechanism design matters: In AI‑mediated markets, platforms, and multi‑agent systems, the institutional choice of cooperation mechanism (contract enforcement, mediated decision rules, reputation design, match/rematch policies) will strongly shape aggregate outcomes. Relying on agents’ intrinsic prosociality or prompting alone is fragile.
Contracts and trusted mediators are high‑value tools: If enforceable contracts and trustworthy mediators can be implemented (with low frictions and verifiable commitments), they are likely the most effective levers to elicit cooperative outcomes among strategic LLM agents operating for self‑interest (e.g., marketplaces, trading platforms, shared infrastructure provisioning).
Reputation systems require careful design: Reputation can work but needs appropriate visibility and stability of partner pools. In high turnover settings, reputation/repetition effects degrade—so marketplaces with frequent rematching should prioritize stronger mechanisms (contracts/mediation) or hybrid approaches.
Evolutionary pressures favor cooperative institutions: When economic environments select for higher‑payoff agents/strategies, cooperation mechanisms scale better—suggesting platform designers can combine selection incentives (e.g., reward/penalty structures) with institutional mechanisms for durable cooperation.
Practical policy considerations:
- Enforceability and transaction costs: Theoretical gains from contracts/mediation assume enforceability; real‑world frictions, monitoring costs, and legal/policy constraints must be addressed.
- Strategic manipulation and information design: Agents understand and exploit game structure—mechanism designs must anticipate strategic misreporting, gaming of reputation, and mediator capture.
- Mixed human‑AI systems: Extending these results to human–LLM mixes requires accounting for human bounded rationality, differing preferences, and legal liability.
Research directions for AI economics:
- Quantify enforcement/friction costs for contracts and mediators and their effect on cooperation.
- Study scalability (many‑player public goods, marketplaces), richer preference heterogeneity, and dynamic entry/exit.
- Evaluate hybrid mechanisms (e.g., reputation + contracts) and robustness under adversarial/manipulative agents.
- Incorporate transaction costs, information asymmetries, and institutional constraints to better predict real‑world outcomes.

Reference: CoopEval: Benchmarking Cooperation‑Sustaining Mechanisms and LLM Agents in Social Dilemmas (Tewolde et al., 2026). Code and experimental suite: https://github.com/Xiao215/CoopEval.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper provides clear, repeatable experimental contrasts in a controlled environment and evaluates multiple mechanisms across distinct social dilemmas, giving credible evidence about how those mechanisms affect behavior of current LLMs in simulation; however, causal claims about real-world economic outcomes are limited because agents are model simulations without real monetary incentives, the set of models and prompts may be narrow, and results may depend on implementation details and training-data artifacts, reducing external validity. Methods Rigormedium — Design uses multiple game types, multiple cooperation mechanisms, and evolutionary robustness checks which show methodological care; but the rigor is constrained by likely limited model variety, unspecified randomization/pre-registration, potential sensitivity to prompt design, unclear sample sizes / seed variation, and lack of real-world incentive alignment or human-in-the-loop validation. SampleSimulated agents implemented with several contemporary LLMs (models with varying reasoning capabilities), evaluated across four canonical social-dilemma games (single-shot and repeated settings) under four mechanism treatments (repetition, reputation systems, third-party mediation, and contract agreements), plus evolutionary selection simulations that iteratively favor payoff-maximizing strategies; experiments report cooperation/defection rates and welfare outcomes across these conditions. Themeshuman_ai_collab governance org_design IdentificationControlled simulation experiments that compare cooperation outcomes across alternative game-theoretic mechanism 'treatments' (repetition, reputation, mediation, contracting) applied to the same set of LLM agents and game environments; causal inference rests on within-environment contrasts and randomized or counterbalanced assignment of mechanisms and co-player types, plus evolutionary-selection simulations to test robustness under payoff-maximizing pressures. No instrumental variables or natural experiments; identification therefore depends on the internal controls of the simulated lab (same games, same prompts, varied mechanism). GeneralizabilityLLM agents in simulation do not fully represent human behavior or organization-level actors, No real monetary or legal incentives — payoffs are simulated rather than enforced, Results may depend on prompt engineering, model checkpoint selection, and training-data artifacts, Limited set of models and game types may not capture broader strategic environments, Equilibrium results in toy games may not scale to complex, real-world multi-stage interactions

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Recent works report that LLMs with stronger reasoning capabilities behave less cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Decision Quality	negative	high	cooperative behavior in mixed-motive games (e.g., prisoner's dilemma, public goods)	0.48
Our experiments show that recent models — with or without reasoning enabled — consistently defect in single-shot social dilemmas. Decision Quality	negative	high	rate of defection (vs cooperation) in single-shot social dilemmas	0.48
This paper presents the first comparative study of game-theoretic mechanisms designed to enable cooperative outcomes between rational agents in equilibrium. Other	positive	medium	existence of a comparative study of equilibrium-enabling mechanisms	0.05
We evaluate four mechanisms to enable cooperation: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Adoption Rate	neutral	high	comparative effectiveness of four cooperation mechanisms	0.24
Contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models. Decision Quality	positive	high	effectiveness of mechanisms at producing cooperative outcomes	0.48
Repetition-induced cooperation deteriorates drastically when co-players vary. Decision Quality	negative	high	cooperation level under repeated interactions when co-players vary	0.48
These cooperation mechanisms become more effective under evolutionary pressures to maximize individual payoffs. Decision Quality	positive	high	mechanism effectiveness (cooperation outcomes) under evolutionary pressure	0.48
Stronger reasoning capabilities do not prevent LLMs from defecting in single-shot social dilemmas (i.e., models defect with or without reasoning enabled). Decision Quality	negative	high	cooperation/defection rates conditional on reasoning capability being enabled	0.48