An expensive text-only ‘manager’ can steer a cheaper code-executing ‘worker’ to match a strong single model’s performance on software-engineering tasks while cutting expensive-token use substantially; the benefit depends on a genuine capability gap and structured, active direction rather than passive review.

Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations

Rui Liu · March 27, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A powerful text-only manager can direct a cheaper code-capable worker to match a strong single model's performance on software-engineering tasks while using far fewer expensive tokens, but gains require a real capability gap and active directed exploration rather than mere review.

Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two-agent pipeline where an expensive "manager" model (text-only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap "worker" model (with full repo access) executes code changes. We evaluate on 200 instances from SWE-bench Lite across five configurations that vary the manager-worker relationship, pipeline complexity, and model pairing. Our findings reveal both the promise and the limits of multi-agent direction: (1) a strong manager directing a weak worker (62%) matches a strong single agent (60%) at a fraction of the strong-model token usage, showing that expensive reasoning can substitute for expensive execution; (2) a weak manager directing a weak worker (42%) performs worse than the weak agent alone (44%), demonstrating that the directing relationship requires a genuine capability gap--structure without substance is pure overhead; (3) the manager's value lies in directing, not merely reviewing--a minimal review-only loop adds just 2pp over the baseline, while structured exploration and planning add 11pp, showing that active direction is what makes the capability gap productive; and (4) these behaviors trace to a single root cause: current models are trained as monolithic agents, and splitting them into director/worker roles fights their training distribution. The pipeline succeeds by designing around this mismatch--keeping each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code. This diagnosis points to concrete training gaps: delegation, scoped execution, and mode switching are skills absent from current training data.

Summary

Main Finding

An expensive, text-only “manager” LLM can effectively direct a cheap, tool-enabled “worker” LLM on software-engineering repair tasks so that the two-agent MANAGERWORKER pipeline (62% resolve rate) matches a single expensive agent (60%) while using far fewer expensive-model tokens. However, this benefit requires a genuine capability gap: pairing a weak manager with a weak worker performs worse than the weak worker alone (42% vs 44%), and a minimal review-only manager yields only modest gains (Simple Loop 53% vs full pipeline 62%). The root cause is a training-distribution mismatch: current models are trained as monolithic agents and lack delegation/role-switching skills.

Key Points

MANAGERWORKER architecture:
- Manager: expensive, text-only, no repo access. Responsible for analysis, structured exploration tasks, planning, and review.
- Worker: cheap, full repo access. Executes exploration tasks, summarizes findings, implements patches.
- Iterative loops: exploration (up to 3 rounds) and implementation (guided first attempt, strict corrections thereafter, up to 3 rounds).
Performance (on 200 SWE-bench Lite instances unless noted):
- MANAGERWORKER (Sonnet 4.6 manager + GPT-5-mini worker): 124/200 = 62%.
- Strong Direct (single Sonnet 4.6): 120/200 = 60%.
- Weak Direct (single GPT-5-mini): 101/200 = 51%.
- Simple Loop (minimal review manager + GPT-5-mini): 106/199 ≈ 53%.
- On a 50-instance subset (for weak-manager ablation): Weak Direct 44%, Weak→Weak (manager and worker both GPT-5-mini) 42%, MANAGERWORKER 64% on same subset.
Cost-profile / token usage (reported examples):
- Strong Direct: ~30k strong-model tokens per instance.
- MANAGERWORKER: ~3–7 lightweight text-only manager calls (~6.6k strong-model tokens total) + 4–12 worker agentic sessions (~60k weak-model tokens).
- Simple Loop: fewer manager calls (~3k strong tokens) and fewer worker sessions.
What drives gains:
- Structured exploration and planning (explicit directed tasks + evidence-synthesizing plans) account for most of the improvement (≈+11 percentage points over weak-alone), while mere review provides little (≈+2pp).
- Keeping each model operating in modes similar to their training (manager: text reasoning, worker: tool use) is critical—giving the manager repo access caused worse performance in ablations.
Failure modes / limits:
- If manager lacks sufficient analytical capability, the pipeline amplifies bad plans (coordination overhead).
- Structure can hurt when there is no capability asymmetry.
- Models lack explicit training in delegation, scoped execution, and switching between “reasoning-only” and “tooling” modes.

Data & Methods

Benchmark: 200-instance subset of SWE-bench Lite (real GitHub issues across Django, Flask, Matplotlib, Scikit-learn, Sphinx, Sympy, etc.). Patches evaluated by running projects’ test suites in Docker.
Models:
- Strong: Claude Sonnet 4.6 (manager in asymmetric configs and single-agent strong baseline).
- Weak: GPT-5-mini (worker and weak single-agent baseline).
- Access via GitHub Copilot CLI / Claude Code SDK to provide consistent agentic tool interfaces.
Pipeline details:
- Phase 1: Manager analysis → generate ≤3 exploration tasks.
- Phases 2–3: Iterative exploration (workers run tasks in parallel, send natural-language reports). Max 3 rounds; on final round manager must either produce implementation plan or proceed.
- Phase 4–5: Iterative implementation. Round 1: guided autonomy (worker adapts); rounds 2+ strict corrective prompts from manager. Max 3 rounds.
- Simple Loop baseline: ~50-line pipeline where worker freely executes and manager only reviews diffs/reports; no explicit exploration/planning.
- Design choices: capping tasks at 3, workers produce summaries not raw files, manager kept text-only.
Metrics:
- Resolve rate: percent tasks where tests pass post-patch.
- Empty patch rate and evaluation error rate tracked for failure modes.

Implications for AI Economics

Cost–quality tradeoffs and product design:
- Substituting expensive execution calls with cheaper execution plus expensive-but-lightweight reasoning calls can deliver the same quality at a lower price per task. This suggests new procurement/product strategies: buy/host small, cheap tool-enabled models for heavy I/O and a smaller amount of expensive reasoning-model capacity for planning and review.
- Cloud/token pricing and orchestration architectures should be optimized for mixed-role workflows (many cheap tool calls + a few expensive text-only calls).
Market segmentation and specialization:
- There is commercial value in offering distinct "manager" and "worker" LLM tiers: a text-reasoning tier priced for low-volume high-value calls, and a tool-enabled execution tier priced for high-volume I/O. Vendors could monetize orchestration primitives and manager templates.
Training and R&D priorities:
- Current models lack delegation, scoped execution, and mode-switching skills. Investing in training data and objectives that expose models to explicit delegation examples, supervisory feedback loops, and role-specific dialogues could raise the ceiling for multi-agent systems and reduce coordination overhead.
- Benchmarks should include organizational/role-based tasks (director vs executor) to capture these skills; SWE-bench–style evaluation with multi-agent pipelines is a useful direction.
Labor and organizational impacts:
- Technically, a manager-worker AI architecture mirrors human cost allocation: expensive judgment concentrated in a few calls, execution distributed cheaply. This could amplify substitution effects where lower-cost AI agents handle execution while higher-cost models (or humans) retain oversight—changing relative value of different human roles.
Risks and caveats for deployment economics:
- Gains depend on a real capability gap. If cheaper models improve in reasoning or expensive models become cheap, the economic case shifts.
- Coordination overhead and failure amplification when capability gaps are small can increase total cost (wasted worker cycles), so empirical ROI must measure end-to-end costs including failed iterations and debugging.
Strategic recommendation:
- For product teams building coding-assist services, experiment with hybrid stacks (text-only reasoning calls, tool-enabled workers) and measure end-to-end token/time cost. Simultaneously invest in training signals and datasets that teach delegation, scoped instruction following, and iterative supervisory feedback to unlock further multi-agent efficiency gains.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports controlled, head-to-head experiments across five pipeline configurations on 200 benchmark instances and compares success rates and token usage, which provides reasonably direct empirical evidence about manager-worker pipelines. However, the sample is modest and narrowly focused on SWE-bench Lite software-engineering tasks and specific model pairings; there is limited information about statistical significance, robustness across many model families, task types, or longer-run deployment, which constrains causal claims and external validity. Methods Rigormedium — The study uses systematic configuration comparisons (manager/worker role variations, exploration vs review loops) and reports clear quantitative metrics (success rates, token usage), which indicates solid experimental design. It appears to lack broader robustness checks (e.g., many model families, languages, or task distributions), pre-registered hypotheses, detailed statistical testing, and real-world deployment tests; training-data and model-architecture sensitivities are not fully probed. Sample200 problem instances drawn from SWE-bench Lite (software-engineering tasks); five pipeline configurations varying manager-worker relationship and pipeline complexity; model pairings include 'strong' (expensive, text-only) and 'weak' (cheaper, tool-enabled with repo access) agents; evaluation metrics include task success rate and token usage (cost proxy) across configurations. Themesproductivity org_design GeneralizabilityLimited to software-engineering tasks from SWE-bench Lite — may not extend to other task domains (e.g., open-ended writing, scientific reasoning)., Results may be model-family specific (depends on exact strong/weak models used) and on the text-only vs tool-enabled split implemented., Benchmarks are modest (200 instances); may not reflect long-run developer workflows, team interactions, or rare/complex bugs., Token-usage cost savings reported may not map directly onto real-world pricing, latency, or infrastructure constraints., Experiments do not measure human-in-the-loop outcomes or organizational adoption barriers, limiting applicability to workplace productivity estimates.

Claims (6)

Claim	Direction	Confidence	Outcome	Details
A strong manager directing a weak worker achieves a 62% success rate on software-engineering tasks, matching a strong single agent which achieves 60%, while using a fraction of the strong-model token usage. Developer Productivity	positive	high	task success rate (percentage of tasks solved)	n=200 62% success (manager->worker) vs 60% success (strong single agent); fraction of strong-model token usage (unspecified fraction) 0.18
A weak manager directing a weak worker achieves a 42% success rate, performing worse than the weak agent alone which achieves 44%. Developer Productivity	negative	high	task success rate (percentage of tasks solved)	n=200 42% success (weak manager->weak worker) vs 44% success (weak agent alone) 0.18
A minimal review-only manager loop adds only 2 percentage points over the baseline, whereas structured exploration and planning by the manager add 11 percentage points, demonstrating that active direction (not mere reviewing) produces most of the benefit. Developer Productivity	positive	high	improvement in task success rate (percentage-point increase)	n=200 review-only: +2 percentage points; structured exploration/planning: +11 percentage points 0.18
The observed behaviors stem from a root cause: current models are trained as monolithic agents, so splitting them into director/worker roles conflicts with their training distribution; retaining each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code enables the pipeline to succeed. Training Effectiveness	mixed	medium	compatibility between model training distribution and assigned role (qualitative performance/behavioral explanation)	0.02
The ManagerWorker two-agent pipeline (expensive text-only manager + cheaper worker with repo access) can substitute expensive execution by using expensive reasoning in the manager and cheaper execution in the worker. Organizational Efficiency	positive	high	ability to substitute expensive execution with expensive reasoning (operationalized as task success rate parity with lower strong-model token usage)	n=200 success parity (62% vs 60%) with reduced strong-model token usage (fraction unspecified) 0.18
Identified concrete training gaps in current models: delegation, scoped execution, and mode switching are skills absent from current training data and limit splitting models into manager/worker roles. Training Effectiveness	neutral	medium	presence/absence of specific training capabilities in model training data (delegation, scoped execution, mode switching)	0.02