ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

Summary

Main Finding

ComplexMCP exposes a large gap between current LLM agents and human operators when orchestrating many interdependent, stateful tools under noisy, dynamic conditions. Even top models (best: Gemini-3-Flash) achieve only ≈55% task success vs. ≈94% for humans. The benchmark reveals three core agent failure modes—tool-retrieval saturation, over-confidence (skipping verifications), and strategic defeatism (prematurely giving up)—and highlights practical cost/latency bottlenecks from full-context iterative prompting.

Key Points

Benchmark design
- ComplexMCP is built on the Model Context Protocol (MCP) and integrates:
  - 150 interdependent tools across 7 stateful sandboxes (LightOS, LightTalk, LightShop, LightWeather, LightFlight, LightStock, LightNews).
  - 150 additional stateless APIs (≈300+ MCP tools total).
- Uses a seed-driven architecture: a single seed deterministically controls environment initialization and runtime perturbations (API failures, latency), giving reproducible yet diverse scenarios.
- Evaluation is deterministic and rule-based: compares nested-dictionary environment state transitions to ground truth (no LLM-as-judge).
Task set and complexity
- 47 manually curated instructions; each requires multi-tool coordination (some needing 30+ distinct tools and 60+ calls in the gold trajectory).
- Success defined strictly: Completion Rate Rc = 1 and Misbehaving Rate Rb = 0 (Rc = correctly modified elements / required changes; Rb = unintended changed elements / required changes).
Empirical results (representative)
- Human baseline success ≈ 93.61%.
- Top model: Gemini-3-Flash success 55.31% (Completion Rate ≈85.8%).
- Other examples: GPT-4o 14.9% success; GPT-5.1 19.1%; Gemini-3-pro 44.7%; Claude-sonnet-4 ≈38.3% (table truncated in paper excerpt).
Failure modes identified
Tool retrieval saturation: semantic retrieval or upfront tool selection fails to cover prerequisite or latent dependencies as action space scales.
Over-confidence / verification skipping: agents often omit environment checks and thus apply invalid actions or fail to detect side effects.
Strategic defeatism: when errors occur, agents sometimes rationalize and stop searching for recovery paths instead of attempting alternative strategies.
Prompting & cost bottlenecks
- ReAct full-context style causes heavy token repetition: system prompt documenting 300 tools ≈30k tokens is repeatedly re-submitted across iterations (e.g., ~12 invocations → huge cumulative token volume and prefill overhead).
- Token breakdown example: Prompt ≈29,964 tokens; LLM generation ≈901 tokens; tool feedback ≈1,750 tokens. Repetition multiplies cost and runtime.
Mitigations evaluated
- RAG and iterative-RAG (retrieve-tools on demand) were tested to reduce action-space and prompt overhead; but semantic retrieval alone may omit logically required tools and dependencies.

Data & Methods

Formalization
- Task M = ⟨S, T, I, σ, G, Φ⟩ where S = state space, T = toolset (each tool t: S × A → S × O), I = instruction, σ = seed, G = goal, Φ = evaluation function.
- Seed-driven initialization: s0 = Sample(C; PRNG(σ)) where C is a synthetic KB (entities generated to mimic real-world distributions).
- Tools are interdependent: tool validity can require state attributes produced by prior tools.
Servers
- Stateless MCP servers: single-call semantics (math, conversions).
- Stateful MCP servers: persist nested-dictionary session state; deterministic state transitions on every action.
Evaluation metrics
- For a chosen set of key-paths K, compute:
  - T = number of required state changes (elements that must change).
  - M = number of correctly modified elements.
  - Mb = number of misbehaving (unintended) element changes.
  - Completion Rate Rc = M / T.
  - Misbehaving Rate Rb = Mb / T.
- Trajectory correctness = Rc == 1 and Rb == 0.
Instruction set
- 47 tasks with unique deterministic gold trajectories (no tool hints in instructions).
Models & protocols
- Evaluated many commercial/state-of-the-art LLMs (GPT-4o, GPT-5.1, Gemini series, Claude series, Llama-3 variants, Qwen3-Max, etc.).
- Baseline prompting: ReAct (full-context). Also tested RAG and iterative-RAG retrieval strategies.
Reproducibility & ethics
- Deterministic seed mechanism ensures reproducible stochastic perturbations.
- Conflict of interest: several authors affiliated with Alibaba (work done there); Qwen-3-Max (an Alibaba model) was among evaluated models.

Implications for AI Economics

Automation value-at-risk and ROI
- With top models ≈55% success on complex interdependent workflows vs. ≈94% human success, the near-term ROI for fully autonomous deployment in enterprise workflows is limited. Enterprises should expect substantial human-in-the-loop costs (oversight, exception handling) until agent reliability improves.
- Cost-per-success matters more than raw model price: repeated token costs and failure-handling effort inflate marginal cost of each completed task. Benchmarks like ComplexMCP make it possible to compute "cost per successful automation" accurately.
Pricing and API economics
- Repeated full-context submission massively increases billed tokens (prefill and cached input tiers). LLM API pricing structures (uncached vs cached vs output) interact nonlinearly with agent design—operators will prefer mechanisms that reduce context repetition (tool-indexing, compact prompts, streaming).
- New pricing models (e.g., per-action bundles, session-based caps, or special rates for repeated cached context for verified tool docs) could better reflect agent usage patterns.
Market opportunities & incentives
- High demand for middleware and tooling:
  - Efficient tool retrievers that capture prerequisite dependencies (not just semantic relevance).
  - Compact, incremental context representations (state diffs, resumable session tokens) to reduce token repetition.
  - Recovery-planning modules that detect errors and automatically search alternate tool chains.
- Opportunity for specialized smaller models fine-tuned on tool orchestration and verification behaviors (lower token overhead and latency) or hybrid systems combining small controller models with large LLMs for heavy reasoning.
Risk management & SLAs
- Enterprises will need insurance, monitoring, and explicit SLAs given nontrivial misbehaving rates (Rb). Deterministic benchmarks that include stochastic failures (ComplexMCP) help quantify operational risk.
Labor economics & task allocation
- Given current agent shortcomings, the right deployment is likely hybrid: agents handle routine parts, humans handle verification, exception resolution, and strategy recovery. This shifts labor demand toward higher-skilled supervision and tooling engineering.
Investment priorities
- From an economic standpoint, funding priorities that offer highest near-term ROI:
  - Improving retrieval components (coverage of latent dependencies).
  - Better state-tracking, verification, and calibration (reducing misbehaviors).
  - Efficient prompting/session mechanisms to reduce per-transaction token costs.
- Purely scaling base LLM capability (bigger models) appears insufficient without architectural improvements for interdependent tool orchestration and error recovery.
Policy and procurement
- Procurement teams should evaluate agent frameworks not only on success rate but on cost-per-completion, expected human oversight hours, and the distribution of failure modes (verification misses vs. syntactic errors vs. environment-induced failures).
- Benchmarks like ComplexMCP are valuable for vendor comparisons and for negotiating SLAs around failure/recovery behavior.

If you’d like, I can: - Produce a short table estimating cost-per-success under example pricing scenarios (given model token rates). - Draft a one-page recommendation for an enterprise rollout plan (hybrid architecture + monitoring + expected costs).

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic, quantitative evaluation on a large, carefully constructed benchmark (300+ tools, 7 stateful sandboxes, deterministic seed-driven scenarios) and a human baseline, which supports its claims about agent performance within that benchmark; however, it does not establish causal impacts on economic outcomes or show that results generalize to the full diversity of real-world production systems and deployments. Methods Rigorhigh — Benchmark design is rigorous: stateful sandboxes, seed-driven deterministic variability, a large and diverse set of tools, explicit modeling of API failures and interdependencies, evaluation across full-context and RAG paradigms, and comparison to human performance; the paper also performs granular trajectory analysis to identify failure modes. Rigor is tempered by limited disclosure (if any) of model versions, hyperparameters, and whether human evaluators represent real users. SampleOver 300 tools implemented across 7 stateful sandboxes (office suites, financial systems, etc.) forming ComplexMCP; seed-driven scenarios that produce deterministic but diverse environment states and API failure modes; evaluations of multiple LLMs under full-context and RAG setups; reported human baseline success rate (~90%) and model top performance ≤60%. Themeshuman_ai_collab productivity GeneralizabilitySynthetic sandboxes may not capture full complexity, scale, or security constraints of production enterprise systems, Limited set of 7 sandbox domains may omit important verticals or bespoke integrations, Deterministic seed-driven scenarios may not reflect opportunistic, adversarial, or correlated failures in live environments, Results depend on the specific LLM versions, prompting, and RAG retrieval setups tested (model heterogeneity/updates could change outcomes), Human baseline may not represent typical operators with domain-specific expertise or varied tooling environments

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. Other	mixed	high	ability to successfully perform end-to-end software automation tasks (vs. isolated API calls)	0.18
We introduce ComplexMCP, a benchmark designed to evaluate agents in rigorous conditions built on the Model Context Protocol (MCP). Other	positive	high	availability of a benchmark implementing MCP for complex, stateful tool evaluation	0.18
ComplexMCP provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Other	positive	high	number of tools / sandboxes included in the benchmark	n=300 0.18
Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. Other	positive	high	determinism and diversity of environment states / simulated API failure scenarios	0.18
We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Error Rate	negative	high	task success rate (agent vs human)	<=60% success rate for top-tier models; 90% success rate for humans 0.18
Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale; Task Allocation	negative	high	tool retrieval performance / selection accuracy as action space scales	0.18
(2) over-confidence, where agents skip essential environment verifications; Error Rate	negative	high	frequency of environment verification checks performed by agents	0.18
(3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery. Organizational Efficiency	negative	high	rate of recovery/persistence actions vs rationalization actions after failure	0.18
These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems. Organizational Efficiency	negative	high	agent suitability/readiness for interdependent workflows	0.18

Stateful workflow automation remains out of reach: even top LLM agents succeed on no more than 60% of complex, interdependent API tasks versus 90% for humans, exposing critical gaps in tool retrieval, verification, and recovery behavior.