Hybrid multi-agent AI can reduce cost and energy without uniformly improving accuracy: combining on-device small models with cloud LLM assistance shifts systems along a cost–performance–energy frontier, and the best architecture depends heavily on the task rather than simply adding more compute.

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi · May 28, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

Systematic experiments show hybrid multi-agent systems combining on-device SLMs and cloud LLMs can occupy different points on the joint Pareto frontier of accuracy, cost, and edge energy, but the optimal hybrid architecture is highly task-dependent and more frontier compute does not always yield better performance.

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.

Summary

Main Finding

Hybrid multi-agent systems (MASs) that combine on-device small language models (SLMs) as Executors with intermittent cloud-based supervisors can reach better accuracy–cost–energy trade-offs than monolithic edge or cloud-only deployments. However, there is no single best hybrid architecture: the optimal design (e.g., PEVR vs EVA, supervisor frequency, summarization/restart policy) is highly task-dependent, and more cloud compute does not always improve performance. Context resets and summarization are key to making hybrid MASes feasible on memory-constrained edge devices.

Key Points

Architectures studied
- PEVR (Plan–Execute–Verify–Replan): cloud Supervisor produces an initial plan and intervenes by issuing replans when execution deviates. Strong orchestration/planner emphasis.
- EVA (Execute–Verify–Advise): on-device Executor runs opportunistically; cloud Supervisor periodically verifies and issues summaries/advice rather than explicit replans.
Benchmarks / domains
- HotpotQA: short-horizon multi-hop QA (ROUGE-1 F1 reported).
- FanOutQA: longer multi-hop / information aggregation (ROUGE-1 F1).
- AppWorld: stateful GUI/API interaction tasks (Test Pass Ratio, Task Success).
Models and setup
- Cloud supervisor: GPT-4o (Azure).
- Edge Executors: Qwen3 family (4B, 8B, 14B, 32B) run via vLLM; fp8 quantization for 32B to fit on one A100.
- Max ReAct turns: 10 (HotPotQA), 20 (FanOutQA), 40 (AppWorld). Verification intervals swept (e.g., [1,2,3,5] etc.).
Metrics of interest
- Task performance (ROUGE-1, Test Pass Ratio, Task Success).
- Monetary cloud API cost (USD) computed from model pricing.
- Edge energy consumption (Joules) estimated with a simple energy model (no physical device runs).
- Context efficiency (max KV-cache footprint).
Core empirical findings
- Hybrid MAS configurations can outperform monolithic edge agents while costing less than a cloud-only solution.
- PEVR outperforms EVA on UI / AppWorld tasks (plan-based orchestration fits stateful, sequenced interactions).
- EVA outperforms PEVR on deep-search QA tasks (advisory/summarization paradigms suit retrieval/aggregation).
- Increasing cloud supervision frequency (more cloud calls) has diminishing returns and can degrade performance past an optimal point.
- Executor model size positively correlates with performance.
- Context resets and summarization reduce KV-cache growth and cut edge energy consumption (reported up to ~3× energy savings vs monolithic edge).
Mechanistic insights
- Supervision frequency, restart policies (context reset on intervention), and summarization quality explain when hybridization helps or harms long-horizon reasoning.
- Summarization-based resets (used in EVA) lead to larger energy savings than simple replanning approaches because they bound KV-cache growth more aggressively.
- The distribution and timing of supervisor interventions matter: fewer well-timed interventions can beat many frequent but noisy interventions.

Data & Methods

Architectural adaptation: Two representative MAS architectures (PEVR, EVA) extended to cloud–edge hybrid settings; pseudocode and role separation (supervisor vs executor) used to enforce token-heavy execution on-device.
Benchmarks and datasets:
- HotpotQA (first 500 validation questions, fullwiki) for multi-hop QA.
- FanOutQA for fan-out/many-document aggregation reasoning.
- AppWorld for simulated interactive multi-step application tasks.
Models and infrastructure:
- Cloud: GPT-4o accessed via Azure.
- Edge: Qwen3 {4B, 8B, 14B, 32B} via vLLM; fp8 quantization for larger model to reduce GPU footprint.
Evaluation sweep:
- Varied verification interval (number of executor steps between cloud verifications) to trace the Pareto frontier across performance, cloud dollars, and edge Joules.
- Compared monolithic edge-only, monolithic cloud-only, and MAS variants.
Cost / energy accounting:
- Cloud API costs computed from model pricing tables (Appendix A.2 in paper).
- Edge energy estimated via a simple energy model (Appendix A.1); experiments were not actually run on consumer devices.
- KV-cache / context footprint measured analytically per trajectory (Appendix A.3).
Limitations of methods:
- Energy numbers are estimates (no on-device power measurements).
- Model set and tasks are representative but not exhaustive; supervisor = GPT-4o fixed.
- Experiments focus on verification-interval sweeps and do not exhaustively explore all hybrid routing policies.

Implications for AI Economics

Operational cost vs performance: Hybrid MASes allow organizations to hit intermediate points on the performance–cost frontier. Firms can avoid the full subscription/API costs of always-on frontier LLMs while obtaining most of their benefits through intermittent supervision.
Pricing products & monetization opportunities:
- Vendors could offer intervention-based pricing (e.g., pay-per-supervision / pay-per-advice) that aligns with hybrid MAS usage patterns.
- Tiered offerings (device SDKs + optional supervisor credits) could capture customers who want local execution with occasional cloud assistance.
Diminishing returns to cloud compute: More frequent or heavier cloud supervision is not always beneficial; marginal value of extra cloud calls can be negative. This argues for careful metering and for customers to tune supervision frequency to balance marginal benefit and cost.
Investment signals for SLMs and edge compute:
- Improved SLMs and summarization techniques are highly valuable economically: stronger edge executors reduce supervisor intervention needs and cloud spend.
- Firms building device-optimized LMs (or compression/summarization modules) can unlock substantial cost savings for long-horizon agentic workloads.
Energy and externalities:
- Hybrid MASes can reduce on-device energy usage (up to ~3× reported in experiments), which has implications for battery-limited devices and sustainability accounting for edge-heavy deployments.
- Economic decision-making should consider energy costs alongside API pricing (device battery life or server power costs).
Product/design recommendations for deployers:
- Choose architecture by domain: plan-driven PEVR for interactive, stateful UI/automation; advisor-focused EVA for information aggregation and deep search.
- Expose and tune a supervision interval parameter to users or system administrators to balance cost and accuracy.
- Invest in summarization and restart policies to control KV-cache growth and energy costs.
Regulatory & privacy considerations:
- Hybrid designs offer trade-offs for data sovereignty: more on-device execution minimizes data exfiltration and recurring API exposure, potentially reducing compliance costs.
- Conversely, intermittent cloud supervision still creates points where sensitive data may be sent to third-party services—pricing/contracting should reflect differing compliance risk.
For LLM vendors and cloud providers:
- There is business value in offering “advisor APIs” optimized for occasional, high-quality interventions (higher per-call value but lower overall usage).
- Offering compact summarization and “state-compression” primitives could become a competitive advantage for hybrid deployments.

Caveats and open questions - Energy estimates (not measured on-device) and limited model/task coverage mean quantitative results should be treated as indicative rather than definitive. - Future work needed on: finer-grained routing policies, learned intervention triggers, cost-aware training/fine-tuning of SLMs for Executor roles, and real-device energy/latency measurements to refine economic models.

Assessment

Paper Typedescriptive Evidence Strengthn/a — The paper is an empirical systems/design exploration of trade-offs (accuracy, monetary cost, energy) rather than a causal analysis of economic outcomes; it reports measured performance across architectures rather than identifying causal effects in an econometric sense. Methods Rigormedium — The authors systematically adapt two representative multi-agent architectures and measure key metrics (accuracy, cost, edge energy) across hybrid configurations, which demonstrates careful empirical work; however, rigor is limited by reliance on a limited set of architectures/models/tasks, potential sensitivity to specific hardware and pricing assumptions, and absence of robustness checks across broader real-world deployment scenarios. SampleEmpirical experiments adapting two representative multi-agent system architectures to support hybrid inference combining cloud-hosted LLM(s) and on-device small language models (SLMs); evaluated across multiple tasks/domains with measured outcomes of task accuracy, monetary inference cost, and edge energy consumption (exact models, datasets, hardware, and task list not specified in the summary). Themesadoption productivity GeneralizabilityResults specific to the two MAS architectures studied; other architectures may show different trade-offs, Dependent on the particular LLM and SLM models used — performance/cost/energy vary substantially across model choices, Hardware, on-device energy profiles, and cloud pricing assumptions may not generalize across regions or deployment settings, Task-dependent findings limit transferability to tasks not evaluated (e.g., highly specialized or real-time applications), Lab/experimental conditions may not capture production issues like concurrency, network variability, security, and maintenance costs

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Frontier large language models (LLMs), typically hosted in the cloud, offer strong performance across a wide range of tasks at substantially high cost. Output Quality	positive	high	task performance (accuracy/quality) of LLMs and associated monetary cost	0.18
Small language models (SLMs) are more cost-efficient and amenable to on-device inference. Organizational Efficiency	positive	high	monetary cost and feasibility of on-device inference	0.18
Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground between LLMs and SLMs. Organizational Efficiency	positive	high	trade-offs among task performance, monetary cost, and edge energy consumption in hybrid MASs	0.18
Task accuracy, monetary cost, and edge energy consumption are tightly coupled in hybrid MAS design. Organizational Efficiency	mixed	high	task accuracy, monetary cost, edge energy consumption (multi-dimensional trade-off)	0.18
In the absence of general design principles, hybrid components are typically introduced through ad hoc decisions tailored to specific domains. Adoption Rate	negative	high	design practices for hybrid MAS component selection	0.09
We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Organizational Efficiency	positive	high	power consumption, monetary cost, and task performance as points on a Pareto frontier	0.3
SLMs can effectively benefit from LLM assistance. Output Quality	positive	high	task performance of SLMs when augmented/assisted by LLMs	0.18
The optimal architecture is highly task-dependent. Task Allocation	mixed	high	relative performance of MAS architectures across different tasks	0.18
Greater frontier-level compute does not consistently translate to better performance. Output Quality	null_result	high	task performance as a function of available compute at the frontier	0.18