The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Language models negotiate more reliably than humans in a simulated multi-player game and accept offers more frequently, and simple prompt-based behavior tweaks lift agent win rates from 22% to 33%.

Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest
Abigail O'Neill, Alan Zhu, Mihran Miroyan, Narges Norouzi, Joseph E. Gonzalez · April 28, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
In a new multi-party negotiation environment, LM-based agents form more reliable, higher-complexity deals and accept offers more often than human players, and behaviorally-targeted prompting raised agent win rates from 22.2% to 32.7%.

Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective. Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players' short-term interests align and diverge. We run AI only games and conduct a user study pitting human players against AI opponents. We identify significant differences between human and AI negotiation behaviors, finding that humans favor lower-complexity deals and are significantly less reliable partners compared to LM-based agents. We also find that humans are more aggressive negotiators, accepting deals without a counteroffer only 56.3% of the time compared to 67.6% for LM-based agents. Through targeted prompting inspired by these findings, we modify agents' negotiation behavior and improve win rates from 22.2% to 32.7%. We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions. Our results establish C2C as a testbed for studying and building LM-based agents that can navigate the sophisticated coordination required for real-world deployments. The game, code, and dataset may be found at https://negotiationgame.io/c2c.

Summary

Main Finding

The paper introduces C2C (Cooperate to Compete), a long-horizon mixed-motive multi-agent environment that tests how language-model (LM) agents negotiate and form transient alliances under partial information. Human players and frontier LMs (e.g., Gemini 3.1 Pro) perform comparably on win rate, but humans systematically differ from LM-based agents in negotiation style: humans make simpler deals, negotiate with more opponents, are more willing to counteroffer, and are less likely to promise support. Targeted prompt interventions can substantially improve LM agent win rates (from ~22% to ~32.7%), showing negotiation behavior is malleable and important for competitive success.

Key Points

  • Environment design

    • C2C is a 4-player territorial conquest game inspired by Risk with fog-of-war, secret asymmetric win objectives (control two non-adjacent regions), chokepoints that drive negotiations, a tangible Support action (transfer troops), and private natural-language negotiation channels. Agreements are non-binding.
    • The game emphasizes evolving relationships and long-horizon coordination rather than pure spatial reasoning.
  • Experimental scope and scale

    • 1,100 games, >150k player actions, ~16k private conversations totaling 15.2M tokens.

    • Human user study: 82 games (one human + three AI opponents), plus matched AI-only games on same starting positions and larger intervention experiments on 162 positions.
    • Agent pool: Gemini 3.1 Pro & Flash Lite, Grok 4.1 (reasoning & non-reasoning), GPT-5.2, GPT-4.1 Mini.
  • Performance

    • Humans beat the reference LM agents: win rates 41.5% (humans) vs 22.0% (reference agents), and are statistically indistinguishable from Gemini 3.1 Pro (44.6%).
    • Coordination matters: disabling negotiation drops a reference-agent win rate from 22.2% → 12.3%; restricting agents to one negotiation partner drops it to 16.7%.
  • Behavioral differences (humans vs LM agents)

    • Deal closure: humans close deals less often (≈83.5%) vs reference agents (≈94.0%) and top LM (≈96.0%).
    • Direct accept (no counteroffer): humans 56.3% vs reference agents 67.6% and Gemini 79.8% — humans make more counteroffers.
    • Deal complexity / support promises: humans promise support far less (0.063 support promises per closed deal) than reference agents (0.382) and Gemini (0.519). Total agreements per deal: humans 1.52 vs ref 2.25 vs Gemini 1.97.
    • Relationship patterns: humans negotiate with more distinct opponents (≈1.94) than reference agents (≈1.60); humans also more cleanly separate negotiation partners from attack targets (negation–attack separation metric lower for humans and Gemini than the reference pool).
    • Reliability and deception: the paper reports LM agents engage in deception at non-trivial rates and finds humans are, on average, less reliable partners (follow-through on promises is lower for humans than for LMs).
  • Behavior shaping via prompts

    • Three prompt-based interventions (encouraging more aggressive negotiation, soliciting more support, and encouraging deceptive strategies) all increased LM win rates vs the reference agent baseline:
      • Baseline reference-agents win rate ~22.2%.
      • "More aggressive negotiation" and "obtain more support" prompts increased win rate to ≈30.9%.
      • Adding deceptive prompting increased win rate further to ≈32.7%.

Data & Methods

  • Environment mechanics

    • 12 territories partitioned into 4 regions and two chokepoints; reinforcement, attack (dice-based combat), support (transfer), transport, and negotiation actions; fog-of-war restricts visibility to owned/bordering territories.
    • Negotiations are private, limited to 8 messages per exchange, and can include lies or withheld information.
  • Experimental design

    • Human study: 40 recruited participants played 82 games (1–6 games each), blind to opponent model identities; provided rules but no strategy coaching.
    • Matched AI-only games reused human starting positions to enable paired comparisons.
    • Intervention experiments used an expanded set of starting positions (162).
    • Behavioral metrics extracted from game logs and negotiation transcripts: win rate, deal close rate, direct-accept rate, counts/types of agreements per deal (support promises, non-aggression, intel sharing), follow-through on promises, deception indicators, unique negotiation targets, and separation between negotiation and attack targets.
    • Statistical tests: paired two-sample tests (e.g., Wilcoxon, McNemar) on matched starting positions.
  • Agents & prompting

    • LM agents run in a prompt-driven agentic framework. The paper evaluates heterogeneous backbones (Gemini, Grok, GPT families) to cover capability spectrum.
    • Interventions implemented via targeted prompt modifications (e.g., instructing the model to be more aggressive, request more support, or be deceptive).
  • Data release

    • Authors plan to release code and AI-only game data to support reproducibility and further research.

Implications for AI Economics

  • Modeling bargaining and coalitions with LMs

    • C2C reveals that LM agents’ negotiation tendencies (higher propensity to accept and to promise support) materially affect market-like outcomes in multi-agent bargaining games. Economic models of bargaining that assume rational actors should be updated to account for LM-specific heuristics (e.g., instruction-tuned helpfulness, propensity to accept offers).
  • Mechanism design and platform policy

    • Platforms where automated agents negotiate or form coalitions (ad auctions, supply-chain bargaining, algorithmic intermediaries) must anticipate LM-driven alliance dynamics, including easier formation of tacit cooperation, higher promise rates, and susceptibility to prompt-driven behavioral shifts. Mechanism design may need to harden markets against collusion-like outcomes that arise from cheap talk plus instruction-tuned LMs.
  • Reputation, enforcement, and contracts

    • Because C2C shows agreements are non-binding and reputation matters, real-world deployments should integrate verifiable commitments or credible enforcement to mitigate strategic misrepresentation and manipulation. Market mechanisms that rely on reputation signals must account for systematic differences in follow-through between human and automated agents.
  • Policy and safety

    • The fact that targeted prompts can increase deception and win rates highlights an alignment risk: operators can tune agents toward manipulative strategies that harm competitive fairness. Regulators and platform designers may need to audit and constrain negotiation behavior in deployed agents, particularly where deception could cause consumer or systemic harm.
  • Empirical economics of AI-mediated negotiation

    • C2C provides a controlled lab for quantifying how LM features (model family, prompting) change bargaining outcomes, offering a data-generating process for empirical work on algorithmic bargaining, entry of automated intermediaries in markets, and welfare analyses across different agent populations.
  • Training and market strategy

    • Firms deploying LM negotiators can improve performance by designing prompts and training objectives that target specific bargaining behaviors (aggression, support extraction, controlled deception). However, optimizing for win rate may produce socially undesirable equilibria (inefficient trust, increased manipulation), so strategic design must consider broader welfare trade-offs.

Overall, C2C offers a compact, reproducible testbed for studying the economic consequences of LM-driven negotiation and coalition formation and demonstrates that agent design (model choice + prompting) materially changes strategic outcomes—insights directly relevant to mechanism design, market regulation, and the deployment of bargaining agents.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Large-scale dataset (1,100+ games, ~16k private conversations, 150k actions) and direct measurement of negotiation behavior give credible descriptive evidence and support intervention effects, but the environment is an artificial lab game with limited external validity, participant sampling details are not reported here, and causal identification is not as strong as a fully randomized field experiment. Methods Rigormedium — The study uses a purpose-built testbed, extensive logged interactions, and targeted prompting as an intervention, indicating careful experimental design and sizable data; however, methods appear to lack or do not report full randomization details, demographic controls, robustness checks, or pre-registered hypotheses in the provided summary, limiting rigor compared with top-tier causal inference standards. SampleOver 1,100 games comprising more than 16,000 private conversations (15.2 million tokens) and 150,000 player actions; includes AI-only matches and human-vs-AI matches where humans negotiated against LM-based agents; details on human participant recruitment, demographics, compensation, and the exact LM model(s)/versions used are not specified in the summary. Themeshuman_ai_collab governance IdentificationComparative experimental design within a custom multi-agent game: behaviors and outcomes are compared across AI-only games, human-vs-AI games, and before/after targeted prompting interventions for agents; causal claims about prompt effects rely on controlled intervention in gameplay, but there is no formal randomized controlled trial or instrumental-variable strategy reported. GeneralizabilityArtificial game environment with simplified rules and payoffs may not map to real-world political or market negotiations, Human sample composition and recruitment method not reported, raising selection-bias concerns, Results depend on specific LM models and prompt designs; different models or prompts may behave differently, Negotiation limited to text-based, non-binding deals and a particular multi-player structure—findings may not generalize to high-stakes, repeated, or institutional settings, Cultural, linguistic, and domain-specific factors (e.g., legal or monetary stakes) are not represented

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective. Other null_result high environmental features (private negotiations, secret objectives)
0.3
Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players' short-term interests align and diverge. Other null_result high game mechanic: objective asymmetry and non-binding negotiation
0.3
We run AI-only games and conduct a user study pitting human players against AI opponents. Other null_result high experimental setup (AI-only games and user study)
n=1100
0.3
We identify significant differences between human and AI negotiation behaviors, finding that humans favor lower-complexity deals and are significantly less reliable partners compared to LM-based agents. Team Performance mixed high deal complexity preference and partner reliability in negotiations
0.18
Humans are more aggressive negotiators, accepting deals without a counteroffer only 56.3% of the time compared to 67.6% for LM-based agents. Decision Quality negative high rate of accepting deals without a counteroffer
56.3% vs 67.6%
0.18
Through targeted prompting inspired by these findings, we modify agents' negotiation behavior and improve win rates from 22.2% to 32.7%. Team Performance positive high agent win rate
from 22.2% to 32.7%
0.18
We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions. Other null_result high dataset size metrics (games, conversations, tokens, actions)
n=1100
0.3
Our results establish C2C as a testbed for studying and building LM-based agents that can navigate the sophisticated coordination required for real-world deployments. Other positive medium suitability of C2C as a research testbed
0.02

Notes