Programmatic state abstraction delivers the biggest performance-per-cost gains in a cyber-defense POMDP (up to 76% higher mean return per token), whereas scattering deliberation tools across hierarchical sub-agents backfires—cutting mean return by up to 3.4× and raising token use 1.8–2.7×; clean modular design, not deeper per-agent reasoning, tends to win.
Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
Summary
Main Finding
Programmatic context engineering (a deterministic state-tracking layer + compressed history) yields the largest returns per token (RPTS) in adversarial, partially observable sequential tasks. By contrast, adding deliberation (self-questioning, critique, improve, CoT) across a multi-agent hierarchy often reduces performance while substantially increasing token costs — a failure mode the authors term a "deliberation cascade." Hierarchical decomposition without distributed deliberation achieves the best absolute performance for most models, but context engineering is generally the most cost-effective first investment.
Key Points
- Programmatic state abstraction is highly cost-effective:
- Deterministic environment/model-based context improves mean episodic return by up to 76% over raw observations.
- Reported cumulative penalty reductions of ~52–76% for four of six evaluated models versus observation-only context.
- These gains come at near-zero marginal token cost (small prompt additions).
- Deliberation cascades:
- Distributing deliberation tools across a hierarchy (Planner + sub-agents) degrades performance in all evaluated model families.
- Negative impact magnitude: up to 3.4× worse mean return while consuming 1.8–2.7× more tokens.
- The cascade arises from cascading uncertainty and extra token-heavy reasoning rounds that worsen downstream decision quality.
- Hierarchy and decomposition:
- A bounded three-agent hierarchy (Planner, Analyst, ActionChooser) without per-agent deliberation often yields the best absolute returns.
- Clean task decomposition plus a shared deterministic state backbone simplifies decision-making for the Planner.
- Cost-performance framing:
- Tokens are used as the primary cost primitive (prompt + completion), mapping directly to billed usage and correlating with latency.
- Returns per token spent (RPTS) is the central evaluation metric; context engineering dominates Pareto frontiers.
- Robustness & generality:
- Results reproduced across five model families (six models total) with deterministic decoding and no per-model tuning.
- Quantitative magnitudes vary by model, but qualitative conclusions (context helps, distributed deliberation hurts) are consistent.
Data & Methods
- Environment:
- CybORG CAGE-2: an adversarial POMDP modeling network defense of a 13-host network over T = 30 steps.
- Defender chooses among five actions (Monitor, Analyse, Remove, Restore, Decoy). Reward r_t ≤ 0 (penalizes compromise/interventions); episodic return G = sum r_t (closer to zero is better).
- Attacker: scripted, non-adaptive multi-stage kill chain; stochastic elements across runs.
- System architecture:
- Four layers: (1) Hierarchy (Planner ± Analyst & ActionChooser with strict JSON contracts), (2) Deterministic Infrastructure (environment model, history, action validator), (3) Context engineering ({observation}, {history}, {network_status} injections), (4) Reasoning (ReAct loop, optional deliberation tools, CoT).
- Deterministic environment model: host-indexed statuses (baseline, changed, unknown, analysed at step n) plus compressed per-host action history and a compressed global {history} injection.
- Reliability guards: output validation, retry up to 3 times, safe fallback to Monitor.
- Deliberation tools (applied inside a single environment step ReAct loop):
- question (self-questioning); critique (self-critique, includes question); improve (revision, includes previous); CoT injection (chain-of-thought scaffolding).
- All deliberation is intra-step (no memory carried across steps).
- Experimental design:
- Models: six models from five families (Grok, Llama, Devstral, Qwen, Gemini variants).
- Configurations: 3 axes — Context (6 variants of {obs}/{hist}/{net}), Deliberation (4 cumulative levels), Hierarchy (2 modes: monolithic Planner vs three-agent delegation). Shared default anchor: hist+net, no deliberation, no delegation.
- Evaluation scale: 72 model–configuration pairs, 3,475 episodes total, ~283.9M tokens instrumented. Standard allocation: 10 instances × 5 runs = 50 episodes/pair (some variants had different budgets as noted).
- Deterministic decoding (greedy, temp=0); knowledge-free initialization (no domain heuristics, no explicit environment descriptions).
- Metrics reported:
- Mean episode return (G), total tokens per episode (prompt + completion), returns per token spent (RPTS), and variability (std dev).
- Constructed three-axis Pareto frontiers (return, tokens, RPTS).
Implications for AI Economics
- Measure cost-effectiveness in tokens, not just raw performance:
- Tokens are a practical, directly billable proxy for inference cost and latency; use RPTS (or a monetary variant) to compare architecture choices.
- High ROI from programmatic infrastructure:
- Investing engineering effort into deterministic state-tracking, context compression, and structured prompts is likely to yield large marginal returns at minimal ongoing inference cost. For product teams and procurement, this often dominates the benefit of buying more expensive model-time compute for per-call deliberation.
- Beware costly reasoning that reduces net return:
- Deliberation (extra model calls, CoT-style prompting) can inflate per-episode token use and — in multi-agent hierarchies — actively reduce task performance (deliberation cascades). Economically, additional reasoning rounds can have negative marginal value; treat them as optional features to be validated by RPTS analysis before deployment.
- Architectural choices interact nonlinearly:
- Additive assumptions (more reasoning + more agents = better) can fail. Economic models for compound AI should account for interaction effects (e.g., combinatorial increases in tokens and potential degradations in outcome quality).
- Practical recommendations for deployments and budgeting:
- Prioritize spending on deterministic context engineering, validation/fallback infrastructure, and clean task decomposition over enabling broad per-agent deliberation tools.
- When deliberation is used, localize it (e.g., top-level Planner only) and limit per-step depth to control token budgets and avoid cascading uncertainty.
- Include system-level reliability mechanisms (validation, safe fallback) in ROI and risk assessments because invalid outputs are costly in sequential adversarial settings.
- For cost modeling and procurement decisions:
- Include token-based pricing, latency constraints, and expected RPTS distributions across model families. Use multi-model A/B testing to estimate variability and tail risk in returns.
- Account for operational risk: deliberation cascades represent a systemic risk where adding capabilities increases both costs and failure probability — analogous to negative externalities in economic systems.
- Research & policy:
- Economic evaluations of compound AI systems should report token-costed performance and Pareto frontiers across architecture choices, not just peak accuracy.
- Regulatory or contractual SLAs for safety-critical sequential systems should require transparency on token consumption, fallbacks, and deterministic state infrastructure to limit hidden operational costs and cascading failures.
Limitations to bear in mind: findings are scoped to structured adversarial POMDPs (CAGE-2); attacker is non-adaptive; the study analyzes static (non-adaptive) architectures at deployment time; quantitative magnitudes vary by model family, so practitioners should reproduce RPTS analysis for their target models and domains.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. Organizational Efficiency | positive | high | mean return (and returns per token spent, RPTS) |
n=3475
up to 76% improvement in mean return
0.48
|
| Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4× worse mean return while using 1.8–2.7× more tokens. Decision Quality | negative | high | mean return (primary) and token usage (secondary) |
n=3475
up to 3.4× worse mean return (and 1.8–2.7× more tokens)
0.48
|
| Hierarchical decomposition without deliberation achieves the best absolute performance for most models. Decision Quality | positive | medium | absolute mean return |
n=3475
0.29
|
| Context engineering (programmatic state abstraction and clean task decomposition) is generally more cost-effective than deeper per-agent deliberation. Organizational Efficiency | positive | high | returns per token spent (RPTS) |
n=3475
0.48
|
| The evaluation spanned five model families, six models, and twelve configurations, totaling 3,475 episodes with token-level cost accounting. Other | null_result | high | study scope (models, configurations, episodes) |
n=3475
0.8
|
| Reward is non-positive in the CybORG CAGE-2 environment, so all configurations operate in a failure-mitigation mode. Other | null_result | high | sign and interpretation of reward |
0.8
|
| When deliberation tools are distributed across a hierarchy they can interact destructively (a 'deliberation cascade'), producing substantially worse returns and higher token costs than hierarchy alone. Decision Quality | negative | medium | mean return and token consumption |
n=3475
0.14
|