A dynamic coordination protocol for LLM teams cuts token use, runtime and coordination failures compared with static or hierarchical approaches, while maintaining or improving task accuracy across multiple tasks and base models.
Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.
Summary
Main Finding
LATTE (Language Agent Teams for Task Evolution) is a hybrid coordination framework for multi-LLM teams that maintains a shared, evolving task graph. By letting workers propose local graph changes while a lead serializes and validates graph mutations, LATTE preserves global consistency and enables opportunistic parallelism. Across three collaborative domains and two large base models, LATTE Pareto-dominates common team architectures (Leader-Worker, static pipelines like MetaGPT, fully decentralized teams) — substantially lowering token usage and wall-clock time, reducing coordination failures, and matching or improving task accuracy.
Key Points
- Core mechanism: a dynamic coordination DAG (nodes = subtasks; edges = dependencies; node state includes assigned agent and status). The frontier set defines immediately executable subtasks and determines maximal safe parallelism.
- Graph mutation operators (with pre/postconditions and invariants): DISCOVER, ASSIGN, CLAIM, COMPLETE, RELEASE, CLOSE, VERIFY. Workers may propose local DISCOVER/COMPLETE/CLAIM actions; the Lead controls graph-wide operators (ASSIGN, RELEASE, VERIFY, CLOSE).
- Execution protocol: initial planning (Lead seeds G0), iterative rounds with heartbeat monitoring (straggler detection), frontier computation, dispatch, parallel execution, and termination. Context scoping: workers see only local subtask + predecessors; Lead sees the graph (not full traces).
- Desiderata addressed: hybrid coordination (consistency + adaptability), adaptive scaling (dispatch = min(|frontier|, |workers|)), fault tolerance/monitoring (heartbeat + RELEASE/VERIFY), and bounded context to limit token/context growth.
- Empirical outcomes (aggregated):
- Accuracy: LATTE ≈ 80% (higher than static graphs, MetaGPT, Leader-Worker; modestly higher than decentralized).
- Token usage: LATTE mean token cost ≈ 47.5% (normalized), roughly half the next-best baseline (static graph ≈ 86.9%); large reductions vs MetaGPT, Leader-Worker, decentralized.
- Wall-clock time: LATTE faster on average (e.g., aggregated ~3.5 minutes vs larger times for baselines).
- Coordination metrics improved: fewer file overwrites, fewer concurrent conflicts, reduced wasted characters, fewer idle rounds and lower straggler latencies.
- Tasks used to stress different coordination needs: exploratory data analysis (high discovery/parallelism), debugging (parallel diagnosis + sequential dependencies), and library extension (known modules with sequential/integration steps).
- Hybrid proposal-evaluate pattern has a probabilistic/Metropolis-Hastings inspiration: workers sample proposals; the Lead evaluates/accepts to maintain global invariants.
Data & Methods
- Experimental design:
- Tasks: 3 domains — exploratory data analysis (opaque dataset with planted properties), debugging (repo with inserted bugs + test suite), and library extension (complete classes & modules from stubs; test suite).
- Baselines: Leader-Worker (single Lead + 4 Workers), MetaGPT (role-based pipeline), decentralized peer team (5 agents), and a static graph ablation (Lead initializes G0; no graph updates).
- Base models: Claude Sonnet 4-6 and GPT-5.2.
- Team size: N = 5 in all conditions to match MetaGPT.
- Trials: 10 runs per condition → total 300 trials (5 team structures × 2 models × 3 tasks × 10).
- Metrics:
- Accuracy/success rate (task-specific test suites).
- Efficiency: tokens consumed, wall-clock time (seconds/minutes), expected cost (tokens or time weighted by completion rate).
- Coordination-specific: overwrite rate, concurrent conflicts, wasted characters, idle rounds, straggler tail latency.
- Key quantitative highlights from reported results:
- Aggregated tokens (mean ± SEM): LATTE ≈ 148K ± 14K vs Leader-Worker 379K ± 51K, Decentralized 419K ± 47K, Static 297K ± 40K, MetaGPT 397K ± 59K.
- Aggregated wall-clock (minutes): LATTE ≈ 3.5 ± 0.3 vs baselines higher (e.g., MetaGPT ≈ 11.5 ± 1.2).
- Per-task accuracy examples: Data analysis ~96% (LATTE), Debug ~100% (LATTE), Library extension ~40% (LATTE) — highlighting domain dependence of gains.
- Statistical testing: Mann–Whitney U tests on pooled normalized costs show LATTE’s reductions are significant (p < 0.01 versus most baselines).
Implications for AI Economics
- Direct cost reductions for LLM workflows: Fewer tokens and lower wall-clock times translate to lower API and compute bills per completed task. A ~50% drop in token consumption (normalized) implies substantial per-project savings when teams use paid LLMs.
- Better utilization and lower marginal cost of scale: LATTE’s frontier-driven dispatch and self-scheduling (CLAIM) reduce idle compute and straggler-induced waits, improving throughput with fixed agent pools — enabling more tasks per dollar and more favorable utilization curves for platform buyers.
- Reduced coordination overheads as a labor-substitute accelerant: By lowering the coordination failure rates (overwrites, redundant outputs, conflict resolution), LATTE raises the effective productivity of LLM teams versus single-expert or rigid pipelines. This could shift where human labor adds value (e.g., oversight, verification rather than low-level integration), affecting labor demand and task composition in software and data workflows.
- Productization and market opportunities: Platforms and toolmakers can capture value by exposing task-graph primitives (CLAIM, DISCOVER, VERIFY) or managed LATTE-style orchestration as a service (SaaS), creating new monetizable layers above base LLM providers — similar to workflow engines for human teams.
- Incentive and pricing implications for LLM providers: If orchestration frameworks materially lower token consumption, providers may see slower revenue growth per downstream workflow unless pricing adapts (e.g., charging for orchestration features, higher per-token prices for integrated solutions, or shifting to subscription/SLA models). Conversely, providers supporting efficient orchestration could gain adoption advantages.
- Risk mitigation and compliance value: LATTE’s explicit, auditable coordination graph and selective verification lower error propagation and make post-hoc inspection easier. For regulated or high-stakes applications, this reduces expected compliance costs and liability exposure, which has economic value (lower insurance, higher willingness to deploy).
- Aggregate market efficiency & externalities: As multi-agent orchestration becomes more efficient, the marginal cost of deploying LLM-based teams declines, increasing demand for automation across industries. That expansion could produce positive network effects (more tooling, specialized agents) but also require updated labor-market adjustments and regulatory attention.
- Limitations & caveats that affect economic interpretation:
- Experimental scale: team size fixed at five; gains may vary with much larger teams or heterogeneous agent pools.
- Task selection: empirical tasks emphasize software/data workflows; other domains (creative, policy, adversarial) may show different cost/accuracy tradeoffs.
- Model assumptions: results reported for two high-end LLMs; efficiency gains may interact with model pricing and performance characteristics (e.g., cheaper smaller models might change the comparative advantage).
- Implementation and human oversight costs: adopting LATTE requires engineering (graph management, heartbeat, verification policies) and possibly human monitors — these adoption costs reduce short-run savings.
- Policy and market implications: regulators and enterprise purchasers should value inspectability and audit trails (LATTE offers explicit artifacts), and procurement should consider orchestration-level SLAs and pricing models rather than only per-token costs.
Overall, LATTE demonstrates that architecture-level coordination design materially alters the economic tradeoffs of multi-LLM workflows: it reduces variable costs (tokens, compute time), improves reliability (reducing rework and supervision), and opens new product and pricing opportunities around orchestration — while the size of those effects will depend on task mix, team scale, and base-model pricing.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems. Other | positive | high | framework introduction / coordination protocol |
0.03
|
| In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. Task Allocation | positive | high | task_allocation and coordination state (coordination graph) |
0.03
|
| LATTE reduces token usage compared to standard designs (including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions). Organizational Efficiency | positive | high | token usage |
0.18
|
| LATTE reduces wall-clock time compared to standard designs. Task Completion Time | positive | high | wall-clock time (task completion time) |
0.18
|
| LATTE reduces communication (and communication overhead) compared to standard designs. Organizational Efficiency | positive | high | communication / communication overhead |
0.18
|
| LATTE reduces coordination failures such as file conflicts and redundant outputs. Error Rate | positive | high | coordination failures (file conflicts, redundant outputs) |
0.18
|
| LATTE matches or exceeds the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions. Output Quality | positive | high | accuracy (output quality) |
0.18
|
| LATTE empowers agents to dynamically allocate work, adapt coordination, and discover new tasks. Task Allocation | positive | high | dynamic task allocation / discovery |
0.18
|
| Existing coordination approaches often occupy two extremes: highly structured methods that rely on fixed roles/pipelines assigned a priori, and fully unstructured teams that enable adaptability but suffer inefficiencies like error propagation, inter-agent conflicts, and wasted resources. Organizational Efficiency | negative | high | coordination efficiency / error propagation / resource waste |
0.03
|
| The LATTE protocol maintains consistency under partial observability and communication constraints while enabling dynamic allocation and adaptation. Organizational Efficiency | positive | high | consistency of coordination under partial observability/communication constraints |
0.18
|
| The evaluation covers multiple collaborative tasks and a variety of base LLM models. Other | positive | high | evaluation breadth (number/types of tasks and models) |
0.18
|