Programmatic state abstraction delivers the biggest performance-per-cost gains in a cyber-defense POMDP (up to 76% higher mean return per token), whereas scattering deliberation tools across hierarchical sub-agents backfires—cutting mean return by up to 3.4× and raising token use 1.8–2.7×; clean modular design, not deeper per-agent reasoning, tends to win.

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman · May 15, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

In a controlled POMDP cyber-defense benchmark, programmatic state abstraction yields the highest returns per token (up to +76%), while distributing deliberation tools across hierarchical sub-agents produces a costly 'deliberation cascade' that reduces performance and increases token use; simple hierarchical decomposition without extra per-agent deliberation gives the best absolute outcomes for most models.

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

Summary

Main Finding

Programmatic context engineering (a deterministic state-tracking layer + compressed history) yields the largest returns per token (RPTS) in adversarial, partially observable sequential tasks. By contrast, adding deliberation (self-questioning, critique, improve, CoT) across a multi-agent hierarchy often reduces performance while substantially increasing token costs — a failure mode the authors term a "deliberation cascade." Hierarchical decomposition without distributed deliberation achieves the best absolute performance for most models, but context engineering is generally the most cost-effective first investment.

Key Points

Programmatic state abstraction is highly cost-effective:
- Deterministic environment/model-based context improves mean episodic return by up to 76% over raw observations.
- Reported cumulative penalty reductions of ~52–76% for four of six evaluated models versus observation-only context.
- These gains come at near-zero marginal token cost (small prompt additions).
Deliberation cascades:
- Distributing deliberation tools across a hierarchy (Planner + sub-agents) degrades performance in all evaluated model families.
- Negative impact magnitude: up to 3.4× worse mean return while consuming 1.8–2.7× more tokens.
- The cascade arises from cascading uncertainty and extra token-heavy reasoning rounds that worsen downstream decision quality.
Hierarchy and decomposition:
- A bounded three-agent hierarchy (Planner, Analyst, ActionChooser) without per-agent deliberation often yields the best absolute returns.
- Clean task decomposition plus a shared deterministic state backbone simplifies decision-making for the Planner.
Cost-performance framing:
- Tokens are used as the primary cost primitive (prompt + completion), mapping directly to billed usage and correlating with latency.
- Returns per token spent (RPTS) is the central evaluation metric; context engineering dominates Pareto frontiers.
Robustness & generality:
- Results reproduced across five model families (six models total) with deterministic decoding and no per-model tuning.
- Quantitative magnitudes vary by model, but qualitative conclusions (context helps, distributed deliberation hurts) are consistent.

Data & Methods

Environment:
- CybORG CAGE-2: an adversarial POMDP modeling network defense of a 13-host network over T = 30 steps.
- Defender chooses among five actions (Monitor, Analyse, Remove, Restore, Decoy). Reward r_t ≤ 0 (penalizes compromise/interventions); episodic return G = sum r_t (closer to zero is better).
- Attacker: scripted, non-adaptive multi-stage kill chain; stochastic elements across runs.
System architecture:
- Four layers: (1) Hierarchy (Planner ± Analyst & ActionChooser with strict JSON contracts), (2) Deterministic Infrastructure (environment model, history, action validator), (3) Context engineering ({observation}, {history}, {network_status} injections), (4) Reasoning (ReAct loop, optional deliberation tools, CoT).
- Deterministic environment model: host-indexed statuses (baseline, changed, unknown, analysed at step n) plus compressed per-host action history and a compressed global {history} injection.
- Reliability guards: output validation, retry up to 3 times, safe fallback to Monitor.
Deliberation tools (applied inside a single environment step ReAct loop):
- question (self-questioning); critique (self-critique, includes question); improve (revision, includes previous); CoT injection (chain-of-thought scaffolding).
- All deliberation is intra-step (no memory carried across steps).
Experimental design:
- Models: six models from five families (Grok, Llama, Devstral, Qwen, Gemini variants).
- Configurations: 3 axes — Context (6 variants of {obs}/{hist}/{net}), Deliberation (4 cumulative levels), Hierarchy (2 modes: monolithic Planner vs three-agent delegation). Shared default anchor: hist+net, no deliberation, no delegation.
- Evaluation scale: 72 model–configuration pairs, 3,475 episodes total, ~283.9M tokens instrumented. Standard allocation: 10 instances × 5 runs = 50 episodes/pair (some variants had different budgets as noted).
- Deterministic decoding (greedy, temp=0); knowledge-free initialization (no domain heuristics, no explicit environment descriptions).
Metrics reported:
- Mean episode return (G), total tokens per episode (prompt + completion), returns per token spent (RPTS), and variability (std dev).
- Constructed three-axis Pareto frontiers (return, tokens, RPTS).

Implications for AI Economics

Measure cost-effectiveness in tokens, not just raw performance:
- Tokens are a practical, directly billable proxy for inference cost and latency; use RPTS (or a monetary variant) to compare architecture choices.
High ROI from programmatic infrastructure:
- Investing engineering effort into deterministic state-tracking, context compression, and structured prompts is likely to yield large marginal returns at minimal ongoing inference cost. For product teams and procurement, this often dominates the benefit of buying more expensive model-time compute for per-call deliberation.
Beware costly reasoning that reduces net return:
- Deliberation (extra model calls, CoT-style prompting) can inflate per-episode token use and — in multi-agent hierarchies — actively reduce task performance (deliberation cascades). Economically, additional reasoning rounds can have negative marginal value; treat them as optional features to be validated by RPTS analysis before deployment.
Architectural choices interact nonlinearly:
- Additive assumptions (more reasoning + more agents = better) can fail. Economic models for compound AI should account for interaction effects (e.g., combinatorial increases in tokens and potential degradations in outcome quality).
Practical recommendations for deployments and budgeting:
- Prioritize spending on deterministic context engineering, validation/fallback infrastructure, and clean task decomposition over enabling broad per-agent deliberation tools.
- When deliberation is used, localize it (e.g., top-level Planner only) and limit per-step depth to control token budgets and avoid cascading uncertainty.
- Include system-level reliability mechanisms (validation, safe fallback) in ROI and risk assessments because invalid outputs are costly in sequential adversarial settings.
For cost modeling and procurement decisions:
- Include token-based pricing, latency constraints, and expected RPTS distributions across model families. Use multi-model A/B testing to estimate variability and tail risk in returns.
- Account for operational risk: deliberation cascades represent a systemic risk where adding capabilities increases both costs and failure probability — analogous to negative externalities in economic systems.
Research & policy:
- Economic evaluations of compound AI systems should report token-costed performance and Pareto frontiers across architecture choices, not just peak accuracy.
- Regulatory or contractual SLAs for safety-critical sequential systems should require transparency on token consumption, fallbacks, and deterministic state infrastructure to limit hidden operational costs and cascading failures.

Limitations to bear in mind: findings are scoped to structured adversarial POMDPs (CAGE-2); attacker is non-adaptive; the study analyzes static (non-adaptive) architectures at deployment time; quantitative magnitudes vary by model family, so practitioners should reproduce RPTS analysis for their target models and domains.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Strong internal comparisons and large episode counts support credible causal inference about design choices within the simulated environment (good control of confounders and direct measurement of cost and reward), but external validity is limited by a single synthetic cyber-defense POMDP, a specific set of LLM families/models, and simulation simplifications that may not generalize to real-world deployments or other task domains. Methods Rigorhigh — Careful, pre-specified manipulation of key design dimensions, large sample of episodes (3,475), inclusion of multiple model families, explicit token-level cost accounting, and reporting of both absolute returns and returns-per-token indicate rigorous experimental methodology; remaining concerns are primarily about scope and external validation rather than internal execution. SampleSimulated cyber-defense environment (CybORG CAGE-2) modeled as a POMDP; reward is non-positive (failure-mitigation regime). Experiments span five model families (six models total), twelve agent configurations (varying raw vs. programmatic state representation, deliberation tools and chain-of-thought, and monolithic vs. hierarchical decomposition), and 3,475 episodes with token-level cost tracking. Themesproductivity org_design IdentificationControlled, factorial-style experiments in a simulated POMDP (CybORG CAGE-2) that systematically vary agent design dimensions (context representation, deliberation tools, hierarchical decomposition) across 12 configurations, five model families and six models, with repeated episodes (3,475) and token-level cost accounting; causal claims rest on within-environment counterfactual comparisons holding environment and task distribution constant. GeneralizabilitySingle simulated environment (CybORG CAGE-2) — results may not hold in other domains or real-world networks, Specific LLM families/models tested — different or future models may interact differently with design choices, Reward structure (non-positive, failure-mitigation) and adversary modeling are particular to the benchmark and may not reflect broader operational objectives, Simulation abstractions (POMDP simplifications, attacker models, network realism) limit external validity to actual cyber operations, Token-cost accounting depends on deployment and pricing regimes; economic conclusions may differ under other cost structures

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. Organizational Efficiency	positive	high	mean return (and returns per token spent, RPTS)	n=3475 up to 76% improvement in mean return 0.48
Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4× worse mean return while using 1.8–2.7× more tokens. Decision Quality	negative	high	mean return (primary) and token usage (secondary)	n=3475 up to 3.4× worse mean return (and 1.8–2.7× more tokens) 0.48
Hierarchical decomposition without deliberation achieves the best absolute performance for most models. Decision Quality	positive	medium	absolute mean return	n=3475 0.29
Context engineering (programmatic state abstraction and clean task decomposition) is generally more cost-effective than deeper per-agent deliberation. Organizational Efficiency	positive	high	returns per token spent (RPTS)	n=3475 0.48
The evaluation spanned five model families, six models, and twelve configurations, totaling 3,475 episodes with token-level cost accounting. Other	null_result	high	study scope (models, configurations, episodes)	n=3475 0.8
Reward is non-positive in the CybORG CAGE-2 environment, so all configurations operate in a failure-mitigation mode. Other	null_result	high	sign and interpretation of reward	0.8
When deliberation tools are distributed across a hierarchy they can interact destructively (a 'deliberation cascade'), producing substantially worse returns and higher token costs than hierarchy alone. Decision Quality	negative	medium	mean return and token consumption	n=3475 0.14