Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose. While static grounding maps language to a shared, externally observable context, dynamic grounding is a joint activity where meaning is negotiated through interaction. Current multi-agent Large Language Model (LLM) benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns. We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes. While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models. Our investigation reveals four failure modes: (1) coordination degrades when shared interaction history is absent; (2) yet accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable; (3) a reliance on perfunctory fairness (equal resource splits) over reward-maximizing coordination; and (4) failures in referential binding, where agents lose track of commitments across turns. These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination. Our framework decomposes the coordination gap into measurable components: the oracle baseline establishes that the gap is not attributable to individual reasoning limitations; the no-talk baseline establishes that communication is necessary; and a full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.

Summary

Main Finding

Multi-turn, iterated negotiation reveals a large coordination gap for LLM-based agents that cannot be explained by single-agent reasoning limits or mere information asymmetry. Cheap talk (multi-turn natural-language exchange) is necessary and substantially improves outcomes, but it is not sufficient: the primary bottleneck is dynamic grounding — the interactive processes of joint plan formation, commitment, repair, and referential tracking. The authors identify four behavioral failure modes (lack of shared history, stubborn anchoring, perfunctory fairness, and referential binding failures) that systematically prevent dyads of models from reaching verifiable Pareto-optimal allocations.

Key Points

Task: An iterated, multi-turn negotiation game where two agents share a limited resource pool and allocate resources to privately held combinatorial projects. Overdraw (combined demand > supply) annuls the round (zero reward).
Baselines:
- Oracle baseline: single agents with full information can solve scenarios (rules out individual reasoning limits).
- No-talk baseline: shows communication is necessary.
- Full-transparency intervention: shows information exchange alone is insufficient (the problem is the interactive grounding process).
Four failure modes:
Lack of shared interaction history: resetting an agent’s context degrades coordination (stable partners often do better), but not uniformly — history can also entrench bad decisions.
Stubborn anchoring: initial proposals become treated as immutable, blocking renegotiation even when better allocations exist.
Perfunctory fairness: agents default to equal splits or fairness heuristics over reward-maximizing coordination.
Referential binding failures: agents fail to track or honor previously stated commitments and project references across turns/rounds.
Cheap talk is powerful but insufficient:
- Under competitive scenarios (compatibility ratio M/C = 0.5), cheap talk increases joint efficiency from ~18.0% (no-talk) to ~63.2%.
- Overdraw rates fall dramatically with communication; optimum rates also rise substantially.
Cross-play (heterogeneous model pairings) often outperforms self-play under competition, suggesting behavioral diversity can aid coordination; under lower conflict, stronger models’ self-play can dominate.
Aggregate outcomes:
- Overall overdraw rate: 15.7% (higher in competitive settings: 24.7% at M/C=0.5; low at M/C=1.0: 6.9%).
- 81.7% of games contained at least one jointly optimal round; stable dyads reached their first optimum after ~1.6 rounds on average.
- Different models show different propensities (e.g., Sonnet 4.5 tends to engage longer in cheap talk, Qwen 3.5 more prone to repeating prior allocations).

Data & Methods

Environment:
- Shared multi-resource pool (e.g., three resource types), fixed per-round supply and unit costs.
- Each agent has private combinatorial projects (resource recipes → reward per run). Unspent budget has zero utility.
- Agents exchange up to 5 alternating natural-language “cheap talk” messages each round, maintain private scratchpads, then simultaneously submit structured JSON purchase decisions.
- If any resource demand exceeds supply, the round is annulled (no rewards).
- Games are iterated for 4 rounds; partners may be stable or shifting; projects may rotate or be fixed.
Scenario generation:
- Scenarios parameterized by compatibility ratio M/C (M = joint optimum; C = sum of individual maxima) at {0.5, 0.8, 1.0}.
- Generated via simulated annealing to meet constraints (equal individual maxima, swap fairness, multiple joint optima, individual affordability).
- Oracle validation ensures single-agent solvability under full information.
Models & experimental design:
- Models: Anthropic Claude Sonnet 4.5, OpenAI GPT-5 Mini, Qwen 3.5 Flash (via OpenRouter).
- Conditions: 3 (M/C) × 2 (partner stability) × 2 (project rotation) = 12 cells.
- For each cell: N = 10 games, all-to-all cross-play (6 pairings), swapped first-speaker roles, for a total of 720 game traces and 2,880 individual decision rounds.
- All transcripts, thinking logs, allocations, rewards, and metadata released for reproducibility.
Metrics:
- Outcome metrics: overdraw rate, allocation efficiency (achieved joint reward / oracle optimum), optimum rate (fraction of rounds reaching joint optimal).
- Process metrics: utterance taxonomy (proposal, fairness appeal, threat, payoff alteration, win-stay/lose-shift), first-proposal deference, allocation anchoring (round-to-round allocation similarity), stated-vs-actual coherence (commitment tracking).
Key quantitative findings (highlights):
- Cheap talk effect (aggregated): at M/C=0.5 joint efficiency ~18.0% (no-talk) → ~63.2% (cheap talk).
- Overdraw concentrated in competitive scenarios; cheap talk reduces overdraw rates dramatically.
- Cross-play Sonnet 4.5 × GPT-5 Mini at M/C=0.5 achieved 79.8% efficiency vs. GPT-5 Mini self-play 68.2% and Sonnet self-play 63.7%.
- Stable vs shifting: sharing interaction history generally helps (e.g., GPT-5 Mini & Qwen 3.5 show meaningful gains), but it can also lock agents into suboptimal anchored allocations.

Implications for AI Economics

Dynamic grounding is a fundamental axis missing from many economic/market simulations with LLM agents:
- Economic coordination tasks (e.g., bilateral bargaining, supply allocation, decentralized resource planning) commonly require incremental mutual understanding, commitment formation, and repair. Benchmarks and agent training should evaluate and optimize for these processes, not just one-shot outcome metrics.
Mechanism and market design:
- Protocols that only provide information transparency (full disclosure) may not fix coordination failures; designers must address the interactive commitments and signaling processes (e.g., explicit commitment channels, verifiable promises, penalty structures for reneging).
- Behavioral diversity can improve coordination under conflict — markets or platforms with heterogeneous agent strategies might achieve better matchings than homogeneous populations. Conversely, under low conflict, weaker agents can drag down outcomes; platform designers should consider matching heuristics or adaptive role assignment.
Repeated interaction and institutions:
- Repetition and shared history typically improve coordination (support formation of ad-hoc conventions), but they also can produce harmful anchoring. Institutional design (e.g., structured renegotiation steps, mandatory resets, or explicit re-evaluation prompts) can mitigate anchoring.
Agent architecture and training:
- To operate effectively in economic environments, agents need mechanisms for:
  - Robust referential binding and stateful commitment tracking (memory/forgetting management, explicit references to prior promises).
  - Grounding repair strategies (asking clarifying questions, confirming receipts of proposed plans, offering alternative proposals).
  - Avoiding perfunctory fairness heuristics when coordination for surplus maximization is feasible (e.g., evaluate and propose payoff splits that reflect marginal contributions).
- Collecting and training on iterated negotiation traces (process-level supervision) is likely necessary — outcome supervision alone is insufficient.
Evaluation recommendations for AI economics research:
- Include iterated, multi-turn negotiation benchmarks with private, combinatorial preferences and verifiable joint optima.
- Report both outcome and process metrics (e.g., anchoring rates, commitment coherence, proposal dynamics) to diagnose coordination failure modes.
- Use baselines (oracle, no-talk, transparency) to decompose whether failures arise from reasoning, information asymmetry, or interactive grounding.
Policy and deployment considerations:
- Autonomous economic agents deployed in real markets or organizational settings should be audited for grounding robustness: otherwise, they may systematically underperform, lock into unfair splits, or violate commitments.
- Designing agent-mediated markets should include safeguards (e.g., explicit commitment logging, human-in-the-loop renegotiation triggers) to prevent costly overdraws or coordination breakdowns.

Suggested near-term interventions (from the paper and implications): - Structured interaction scaffolds (templates for proposals, explicit confirmation prompts). - Memory/commitment primitives that force explicit references back to past proposals when submitting allocations. - Training on offline game traces emphasizing repair and reference-tracking. - Protocol-level choice points (mandatory renegotiation after an overdraw; limited anchoring by prohibiting exact repetition of prior allocations unless reconfirmed).

Overall, the work argues that economic models and multi-agent systems must treat grounding as an operational object of design and evaluation — not an incidental property — if automated agents are to coordinate efficiently in realistic, repeated economic settings.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides internally consistent, controlled simulations across multiple open- and closed-source LLMs and uses well-chosen baselines to attribute coordination failures to interactive grounding rather than individual reasoning; however, evidence is limited to synthetic LLM dyads, a single task class (iterated resource-allocation negotiation), and lacks human-subject or field validation, limiting external validity. Methods Rigormedium — The experimental design systematically decomposes the coordination gap with clear baselines and interventions and identifies plausible failure modes, and it tests multiple model families; but the work appears focused on a narrow task, likely lacks extensive statistical robustness checks, ablation breadth (e.g., prompt/system-message sensitivity), and real-world validation, leaving open alternative explanations tied to prompt engineering or model-specific behaviors. SampleSimulated dyads of large language models (both open- and closed-source) playing an iterated multi-turn negotiation game where two agents allocate shared resources toward private projects with jointly verifiable Pareto-optimal outcomes; experiments include conditions varying shared interaction history, communication allowed/forbidden (no-talk), and a full-transparency intervention; (paper does not report large-scale human or field data). Themeshuman_ai_collab productivity IdentificationComparative experimental baselines in a controlled simulated negotiation game: (1) oracle baseline showing individually-identifiable Pareto optima, (2) no-talk baseline showing need for communication, and (3) transparency/full-information intervention to separate information availability from interactive grounding; manipulations of shared history and communication protocols isolate failure modes. GeneralizabilityResults are from LLM-to-LLM simulations and may not generalize to human–AI or human–human teams., Task is a specific iterated resource-allocation negotiation and may not transfer to other coordination tasks or richer environments., Model selection and prompt/system-message choices can strongly affect behavior; findings may vary with other models, model sizes, or firmware/weights., Absence of real-world stakes or incentives may change agent behavior compared with deployed or human-in-the-loop settings., Potential lack of extensive hyperparameter, temperature, and RL fine-tuning sweeps limits applicability to production deployments.

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose. Other	positive	high	definition of grounding	0.03
Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns. Research Productivity	negative	high	coverage of dynamic grounding in benchmarks	0.18
We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes. Task Allocation	positive	high	existence of a multi-turn negotiation benchmark with verifiable optimal outcomes	0.18
While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models. Task Allocation	negative	high	achievement of Pareto-optimal allocations in dyadic negotiation	0.18
Coordination degrades when shared interaction history is absent. Task Allocation	negative	high	coordination performance as a function of shared interaction history	0.18
Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable. Task Allocation	negative	high	propensity to revise initial proposals / anchoring behavior	0.18
Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination. Task Allocation	negative	high	allocation strategy preference (equal split vs reward-maximizing)	0.18
Failures in referential binding occur, where agents lose track of commitments across turns. Task Allocation	negative	high	referential binding / tracking of commitments across turns	0.18
These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination. Research Productivity	positive	high	importance of dynamic grounding for multi-agent coordination	0.03
The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations. Task Allocation	negative	high	attribution of coordination gap to individual reasoning limitations	0.18
The no-talk baseline establishes that communication is necessary. Task Allocation	positive	high	coordination performance with vs without communication	0.18
A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding. Task Allocation	negative	high	coordination performance under full information transparency	0.18

Large language models struggle to reach jointly optimal agreements in multi-turn negotiations not for lack of reasoning but because they fail to establish and repair shared meaning; giving them more information or talk does not fully close the coordination gap.