Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.

Summary

Main Finding

LATTE (Language Agent Teams for Task Evolution) is a hybrid coordination framework for multi-LLM teams that maintains a shared, evolving task graph. By letting workers propose local graph changes while a lead serializes and validates graph mutations, LATTE preserves global consistency and enables opportunistic parallelism. Across three collaborative domains and two large base models, LATTE Pareto-dominates common team architectures (Leader-Worker, static pipelines like MetaGPT, fully decentralized teams) — substantially lowering token usage and wall-clock time, reducing coordination failures, and matching or improving task accuracy.

Key Points

Core mechanism: a dynamic coordination DAG (nodes = subtasks; edges = dependencies; node state includes assigned agent and status). The frontier set defines immediately executable subtasks and determines maximal safe parallelism.
Graph mutation operators (with pre/postconditions and invariants): DISCOVER, ASSIGN, CLAIM, COMPLETE, RELEASE, CLOSE, VERIFY. Workers may propose local DISCOVER/COMPLETE/CLAIM actions; the Lead controls graph-wide operators (ASSIGN, RELEASE, VERIFY, CLOSE).
Execution protocol: initial planning (Lead seeds G0), iterative rounds with heartbeat monitoring (straggler detection), frontier computation, dispatch, parallel execution, and termination. Context scoping: workers see only local subtask + predecessors; Lead sees the graph (not full traces).
Desiderata addressed: hybrid coordination (consistency + adaptability), adaptive scaling (dispatch = min(|frontier|, |workers|)), fault tolerance/monitoring (heartbeat + RELEASE/VERIFY), and bounded context to limit token/context growth.
Empirical outcomes (aggregated):
- Accuracy: LATTE ≈ 80% (higher than static graphs, MetaGPT, Leader-Worker; modestly higher than decentralized).
- Token usage: LATTE mean token cost ≈ 47.5% (normalized), roughly half the next-best baseline (static graph ≈ 86.9%); large reductions vs MetaGPT, Leader-Worker, decentralized.
- Wall-clock time: LATTE faster on average (e.g., aggregated ~3.5 minutes vs larger times for baselines).
- Coordination metrics improved: fewer file overwrites, fewer concurrent conflicts, reduced wasted characters, fewer idle rounds and lower straggler latencies.
Tasks used to stress different coordination needs: exploratory data analysis (high discovery/parallelism), debugging (parallel diagnosis + sequential dependencies), and library extension (known modules with sequential/integration steps).
Hybrid proposal-evaluate pattern has a probabilistic/Metropolis-Hastings inspiration: workers sample proposals; the Lead evaluates/accepts to maintain global invariants.

Data & Methods

Experimental design:
- Tasks: 3 domains — exploratory data analysis (opaque dataset with planted properties), debugging (repo with inserted bugs + test suite), and library extension (complete classes & modules from stubs; test suite).
- Baselines: Leader-Worker (single Lead + 4 Workers), MetaGPT (role-based pipeline), decentralized peer team (5 agents), and a static graph ablation (Lead initializes G0; no graph updates).
- Base models: Claude Sonnet 4-6 and GPT-5.2.
- Team size: N = 5 in all conditions to match MetaGPT.
- Trials: 10 runs per condition → total 300 trials (5 team structures × 2 models × 3 tasks × 10).
Metrics:
- Accuracy/success rate (task-specific test suites).
- Efficiency: tokens consumed, wall-clock time (seconds/minutes), expected cost (tokens or time weighted by completion rate).
- Coordination-specific: overwrite rate, concurrent conflicts, wasted characters, idle rounds, straggler tail latency.
Key quantitative highlights from reported results:
- Aggregated tokens (mean ± SEM): LATTE ≈ 148K ± 14K vs Leader-Worker 379K ± 51K, Decentralized 419K ± 47K, Static 297K ± 40K, MetaGPT 397K ± 59K.
- Aggregated wall-clock (minutes): LATTE ≈ 3.5 ± 0.3 vs baselines higher (e.g., MetaGPT ≈ 11.5 ± 1.2).
- Per-task accuracy examples: Data analysis ~96% (LATTE), Debug ~100% (LATTE), Library extension ~40% (LATTE) — highlighting domain dependence of gains.
Statistical testing: Mann–Whitney U tests on pooled normalized costs show LATTE’s reductions are significant (p < 0.01 versus most baselines).

Implications for AI Economics

Direct cost reductions for LLM workflows: Fewer tokens and lower wall-clock times translate to lower API and compute bills per completed task. A ~50% drop in token consumption (normalized) implies substantial per-project savings when teams use paid LLMs.
Better utilization and lower marginal cost of scale: LATTE’s frontier-driven dispatch and self-scheduling (CLAIM) reduce idle compute and straggler-induced waits, improving throughput with fixed agent pools — enabling more tasks per dollar and more favorable utilization curves for platform buyers.
Reduced coordination overheads as a labor-substitute accelerant: By lowering the coordination failure rates (overwrites, redundant outputs, conflict resolution), LATTE raises the effective productivity of LLM teams versus single-expert or rigid pipelines. This could shift where human labor adds value (e.g., oversight, verification rather than low-level integration), affecting labor demand and task composition in software and data workflows.
Productization and market opportunities: Platforms and toolmakers can capture value by exposing task-graph primitives (CLAIM, DISCOVER, VERIFY) or managed LATTE-style orchestration as a service (SaaS), creating new monetizable layers above base LLM providers — similar to workflow engines for human teams.
Incentive and pricing implications for LLM providers: If orchestration frameworks materially lower token consumption, providers may see slower revenue growth per downstream workflow unless pricing adapts (e.g., charging for orchestration features, higher per-token prices for integrated solutions, or shifting to subscription/SLA models). Conversely, providers supporting efficient orchestration could gain adoption advantages.
Risk mitigation and compliance value: LATTE’s explicit, auditable coordination graph and selective verification lower error propagation and make post-hoc inspection easier. For regulated or high-stakes applications, this reduces expected compliance costs and liability exposure, which has economic value (lower insurance, higher willingness to deploy).
Aggregate market efficiency & externalities: As multi-agent orchestration becomes more efficient, the marginal cost of deploying LLM-based teams declines, increasing demand for automation across industries. That expansion could produce positive network effects (more tooling, specialized agents) but also require updated labor-market adjustments and regulatory attention.
Limitations & caveats that affect economic interpretation:
- Experimental scale: team size fixed at five; gains may vary with much larger teams or heterogeneous agent pools.
- Task selection: empirical tasks emphasize software/data workflows; other domains (creative, policy, adversarial) may show different cost/accuracy tradeoffs.
- Model assumptions: results reported for two high-end LLMs; efficiency gains may interact with model pricing and performance characteristics (e.g., cheaper smaller models might change the comparative advantage).
- Implementation and human oversight costs: adopting LATTE requires engineering (graph management, heartbeat, verification policies) and possibly human monitors — these adoption costs reduce short-run savings.
Policy and market implications: regulators and enterprise purchasers should value inspectability and audit trails (LATTE offers explicit artifacts), and procurement should consider orchestration-level SLAs and pricing models rather than only per-token costs.

Overall, LATTE demonstrates that architecture-level coordination design materially alters the economic tradeoffs of multi-LLM workflows: it reduces variable costs (tokens, compute time), improves reliability (reducing rework and supervision), and opens new product and pricing opportunities around orchestration — while the size of those effects will depend on task mix, team scale, and base-model pricing.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a systems/engineering paper evaluating a coordination protocol for LLM agent teams rather than a causal study of economic outcomes; it reports performance improvements (tokens, time, failures, accuracy) but does not claim causal identification in an economic sense. Methods Rigormedium — Authors compare LATTE to multiple existing coordination designs (MetaGPT, decentralized teams, leader-worker, static decompositions) across several collaborative tasks and base LLMs and measure multiple metrics (token use, wall-clock time, communication, coordination failures, accuracy). However, experiments appear limited to simulated agent teams and benchmark tasks with no real-world organizational deployments or human-in-the-loop evaluations reported; details on statistical significance, sensitivity analyses, and full reproducibility (hyperparameters, seeds, full task set) are not described here. SampleExperimental evaluations across multiple collaborative tasks (unspecified task suite of multi-agent coordination problems) and a variety of base LLMs; simulated teams of language agents implemented under different coordination protocols; comparisons include MetaGPT, decentralized teams, leader-worker hierarchies, and static decompositions; outcomes measured: token usage, wall-clock time, communication, coordination failures (file conflicts, redundant outputs), and task accuracy. Themesproductivity org_design GeneralizabilityResults from simulated LLM-agent teams may not generalize to human-AI teams or real organizational settings., Task suite may be narrow or synthetic; performance on other task domains (e.g., open-ended creative work, high-stakes decision making) is unknown., Evaluations depend on the particular base LLMs tested; gains may change with model scale, architecture, or future models., Scalability to much larger teams, varied latency/network conditions, or heterogeneous agent capabilities is untested., Metrics like token usage and file conflicts map imperfectly to real economic productivity or labor outcomes.

Claims (11)

Claim	Direction	Confidence	Outcome	Details
We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems. Other	positive	high	framework introduction / coordination protocol	0.03
In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. Task Allocation	positive	high	task_allocation and coordination state (coordination graph)	0.03
LATTE reduces token usage compared to standard designs (including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions). Organizational Efficiency	positive	high	token usage	0.18
LATTE reduces wall-clock time compared to standard designs. Task Completion Time	positive	high	wall-clock time (task completion time)	0.18
LATTE reduces communication (and communication overhead) compared to standard designs. Organizational Efficiency	positive	high	communication / communication overhead	0.18
LATTE reduces coordination failures such as file conflicts and redundant outputs. Error Rate	positive	high	coordination failures (file conflicts, redundant outputs)	0.18
LATTE matches or exceeds the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions. Output Quality	positive	high	accuracy (output quality)	0.18
LATTE empowers agents to dynamically allocate work, adapt coordination, and discover new tasks. Task Allocation	positive	high	dynamic task allocation / discovery	0.18
Existing coordination approaches often occupy two extremes: highly structured methods that rely on fixed roles/pipelines assigned a priori, and fully unstructured teams that enable adaptability but suffer inefficiencies like error propagation, inter-agent conflicts, and wasted resources. Organizational Efficiency	negative	high	coordination efficiency / error propagation / resource waste	0.03
The LATTE protocol maintains consistency under partial observability and communication constraints while enabling dynamic allocation and adaptation. Organizational Efficiency	positive	high	consistency of coordination under partial observability/communication constraints	0.18
The evaluation covers multiple collaborative tasks and a variety of base LLM models. Other	positive	high	evaluation breadth (number/types of tasks and models)	0.18

A dynamic coordination protocol for LLM teams cuts token use, runtime and coordination failures compared with static or hierarchical approaches, while maintaining or improving task accuracy across multiple tasks and base models.