Large language models struggle to reach jointly optimal agreements in multi-turn negotiations not for lack of reasoning but because they fail to establish and repair shared meaning; giving them more information or talk does not fully close the coordination gap.
Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose. While static grounding maps language to a shared, externally observable context, dynamic grounding is a joint activity where meaning is negotiated through interaction. Current multi-agent Large Language Model (LLM) benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns. We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes. While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models. Our investigation reveals four failure modes: (1) coordination degrades when shared interaction history is absent; (2) yet accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable; (3) a reliance on perfunctory fairness (equal resource splits) over reward-maximizing coordination; and (4) failures in referential binding, where agents lose track of commitments across turns. These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination. Our framework decomposes the coordination gap into measurable components: the oracle baseline establishes that the gap is not attributable to individual reasoning limitations; the no-talk baseline establishes that communication is necessary; and a full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.
Summary
Main Finding
Multi-turn, iterated negotiation reveals a large coordination gap for LLM-based agents that cannot be explained by single-agent reasoning limits or mere information asymmetry. Cheap talk (multi-turn natural-language exchange) is necessary and substantially improves outcomes, but it is not sufficient: the primary bottleneck is dynamic grounding — the interactive processes of joint plan formation, commitment, repair, and referential tracking. The authors identify four behavioral failure modes (lack of shared history, stubborn anchoring, perfunctory fairness, and referential binding failures) that systematically prevent dyads of models from reaching verifiable Pareto-optimal allocations.
Key Points
- Task: An iterated, multi-turn negotiation game where two agents share a limited resource pool and allocate resources to privately held combinatorial projects. Overdraw (combined demand > supply) annuls the round (zero reward).
- Baselines:
- Oracle baseline: single agents with full information can solve scenarios (rules out individual reasoning limits).
- No-talk baseline: shows communication is necessary.
- Full-transparency intervention: shows information exchange alone is insufficient (the problem is the interactive grounding process).
- Four failure modes:
- Lack of shared interaction history: resetting an agent’s context degrades coordination (stable partners often do better), but not uniformly — history can also entrench bad decisions.
- Stubborn anchoring: initial proposals become treated as immutable, blocking renegotiation even when better allocations exist.
- Perfunctory fairness: agents default to equal splits or fairness heuristics over reward-maximizing coordination.
- Referential binding failures: agents fail to track or honor previously stated commitments and project references across turns/rounds.
- Cheap talk is powerful but insufficient:
- Under competitive scenarios (compatibility ratio M/C = 0.5), cheap talk increases joint efficiency from ~18.0% (no-talk) to ~63.2%.
- Overdraw rates fall dramatically with communication; optimum rates also rise substantially.
- Cross-play (heterogeneous model pairings) often outperforms self-play under competition, suggesting behavioral diversity can aid coordination; under lower conflict, stronger models’ self-play can dominate.
- Aggregate outcomes:
- Overall overdraw rate: 15.7% (higher in competitive settings: 24.7% at M/C=0.5; low at M/C=1.0: 6.9%).
- 81.7% of games contained at least one jointly optimal round; stable dyads reached their first optimum after ~1.6 rounds on average.
- Different models show different propensities (e.g., Sonnet 4.5 tends to engage longer in cheap talk, Qwen 3.5 more prone to repeating prior allocations).
Data & Methods
- Environment:
- Shared multi-resource pool (e.g., three resource types), fixed per-round supply and unit costs.
- Each agent has private combinatorial projects (resource recipes → reward per run). Unspent budget has zero utility.
- Agents exchange up to 5 alternating natural-language “cheap talk” messages each round, maintain private scratchpads, then simultaneously submit structured JSON purchase decisions.
- If any resource demand exceeds supply, the round is annulled (no rewards).
- Games are iterated for 4 rounds; partners may be stable or shifting; projects may rotate or be fixed.
- Scenario generation:
- Scenarios parameterized by compatibility ratio M/C (M = joint optimum; C = sum of individual maxima) at {0.5, 0.8, 1.0}.
- Generated via simulated annealing to meet constraints (equal individual maxima, swap fairness, multiple joint optima, individual affordability).
- Oracle validation ensures single-agent solvability under full information.
- Models & experimental design:
- Models: Anthropic Claude Sonnet 4.5, OpenAI GPT-5 Mini, Qwen 3.5 Flash (via OpenRouter).
- Conditions: 3 (M/C) × 2 (partner stability) × 2 (project rotation) = 12 cells.
- For each cell: N = 10 games, all-to-all cross-play (6 pairings), swapped first-speaker roles, for a total of 720 game traces and 2,880 individual decision rounds.
- All transcripts, thinking logs, allocations, rewards, and metadata released for reproducibility.
- Metrics:
- Outcome metrics: overdraw rate, allocation efficiency (achieved joint reward / oracle optimum), optimum rate (fraction of rounds reaching joint optimal).
- Process metrics: utterance taxonomy (proposal, fairness appeal, threat, payoff alteration, win-stay/lose-shift), first-proposal deference, allocation anchoring (round-to-round allocation similarity), stated-vs-actual coherence (commitment tracking).
- Key quantitative findings (highlights):
- Cheap talk effect (aggregated): at M/C=0.5 joint efficiency ~18.0% (no-talk) → ~63.2% (cheap talk).
- Overdraw concentrated in competitive scenarios; cheap talk reduces overdraw rates dramatically.
- Cross-play Sonnet 4.5 × GPT-5 Mini at M/C=0.5 achieved 79.8% efficiency vs. GPT-5 Mini self-play 68.2% and Sonnet self-play 63.7%.
- Stable vs shifting: sharing interaction history generally helps (e.g., GPT-5 Mini & Qwen 3.5 show meaningful gains), but it can also lock agents into suboptimal anchored allocations.
Implications for AI Economics
- Dynamic grounding is a fundamental axis missing from many economic/market simulations with LLM agents:
- Economic coordination tasks (e.g., bilateral bargaining, supply allocation, decentralized resource planning) commonly require incremental mutual understanding, commitment formation, and repair. Benchmarks and agent training should evaluate and optimize for these processes, not just one-shot outcome metrics.
- Mechanism and market design:
- Protocols that only provide information transparency (full disclosure) may not fix coordination failures; designers must address the interactive commitments and signaling processes (e.g., explicit commitment channels, verifiable promises, penalty structures for reneging).
- Behavioral diversity can improve coordination under conflict — markets or platforms with heterogeneous agent strategies might achieve better matchings than homogeneous populations. Conversely, under low conflict, weaker agents can drag down outcomes; platform designers should consider matching heuristics or adaptive role assignment.
- Repeated interaction and institutions:
- Repetition and shared history typically improve coordination (support formation of ad-hoc conventions), but they also can produce harmful anchoring. Institutional design (e.g., structured renegotiation steps, mandatory resets, or explicit re-evaluation prompts) can mitigate anchoring.
- Agent architecture and training:
- To operate effectively in economic environments, agents need mechanisms for:
- Robust referential binding and stateful commitment tracking (memory/forgetting management, explicit references to prior promises).
- Grounding repair strategies (asking clarifying questions, confirming receipts of proposed plans, offering alternative proposals).
- Avoiding perfunctory fairness heuristics when coordination for surplus maximization is feasible (e.g., evaluate and propose payoff splits that reflect marginal contributions).
- Collecting and training on iterated negotiation traces (process-level supervision) is likely necessary — outcome supervision alone is insufficient.
- To operate effectively in economic environments, agents need mechanisms for:
- Evaluation recommendations for AI economics research:
- Include iterated, multi-turn negotiation benchmarks with private, combinatorial preferences and verifiable joint optima.
- Report both outcome and process metrics (e.g., anchoring rates, commitment coherence, proposal dynamics) to diagnose coordination failure modes.
- Use baselines (oracle, no-talk, transparency) to decompose whether failures arise from reasoning, information asymmetry, or interactive grounding.
- Policy and deployment considerations:
- Autonomous economic agents deployed in real markets or organizational settings should be audited for grounding robustness: otherwise, they may systematically underperform, lock into unfair splits, or violate commitments.
- Designing agent-mediated markets should include safeguards (e.g., explicit commitment logging, human-in-the-loop renegotiation triggers) to prevent costly overdraws or coordination breakdowns.
Suggested near-term interventions (from the paper and implications): - Structured interaction scaffolds (templates for proposals, explicit confirmation prompts). - Memory/commitment primitives that force explicit references back to past proposals when submitting allocations. - Training on offline game traces emphasizing repair and reference-tracking. - Protocol-level choice points (mandatory renegotiation after an overdraw; limited anchoring by prohibiting exact repetition of prior allocations unless reconfirmed).
Overall, the work argues that economic models and multi-agent systems must treat grounding as an operational object of design and evaluation — not an incidental property — if automated agents are to coordinate efficiently in realistic, repeated economic settings.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose. Other | positive | high | definition of grounding |
0.03
|
| Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns. Research Productivity | negative | high | coverage of dynamic grounding in benchmarks |
0.18
|
| We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes. Task Allocation | positive | high | existence of a multi-turn negotiation benchmark with verifiable optimal outcomes |
0.18
|
| While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models. Task Allocation | negative | high | achievement of Pareto-optimal allocations in dyadic negotiation |
0.18
|
| Coordination degrades when shared interaction history is absent. Task Allocation | negative | high | coordination performance as a function of shared interaction history |
0.18
|
| Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable. Task Allocation | negative | high | propensity to revise initial proposals / anchoring behavior |
0.18
|
| Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination. Task Allocation | negative | high | allocation strategy preference (equal split vs reward-maximizing) |
0.18
|
| Failures in referential binding occur, where agents lose track of commitments across turns. Task Allocation | negative | high | referential binding / tracking of commitments across turns |
0.18
|
| These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination. Research Productivity | positive | high | importance of dynamic grounding for multi-agent coordination |
0.03
|
| The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations. Task Allocation | negative | high | attribution of coordination gap to individual reasoning limitations |
0.18
|
| The no-talk baseline establishes that communication is necessary. Task Allocation | positive | high | coordination performance with vs without communication |
0.18
|
| A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding. Task Allocation | negative | high | coordination performance under full information transparency |
0.18
|