Structured 'Skills' provide little marginal gain for an autonomous cybersecurity agent: across 180 CTF runs the full-Skills condition outperformed no-Skills by only 8.9 percentage points (statistically insignificant), and in environments that return precise, low-latency feedback curated Skills often cease to help or can degrade performance.
Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.
Summary
Main Finding
A controlled re-analysis of a 180-run study of an MCP-grounded autonomous Capture‑The‑Flag (CTF) agent shows that adding curated Agent Skills yields at most an 8.9 percentage‑point (pp) increase in pass rate versus a No‑Skills baseline—and that increase is not statistically significant (χ2 p = 0.71; Cochran–Armitage p = 0.25). In this high-feedback‑bandwidth domain (schema‑validated, low‑latency tool outputs), the marginal benefit of Skills collapses relative to previously reported cross‑domain averages (SkillsBench average +16.2 pp). The authors propose the "feedback‑bandwidth" hypothesis: Skills are less valuable when the environment provides deterministic, structured, and timely corrective feedback.
Key Points
- Experimental mapping: four documentation conditions correspond to No‑Skills, Experiential‑Skills, Curated‑Skills, and Comprehensive‑Skills (55, 1,478, 1,976, 4,147 lines respectively).
- Pass rates (45 trials per condition):
- No‑Skills: 77.8% (35/45), mean time 20.1 min
- Experiential: 82.2% (37/45), mean time 19.1 min (+4.4 pp)
- Curated: 84.4% (38/45), mean time 18.5 min (+6.6 pp)
- Comprehensive: 86.7% (39/45), mean time 17.1 min (+8.9 pp)
- Statistical effect sizes: five of six pairwise Cohen’s h < 0.2 (small); only No‑Skills vs Comprehensive at h = 0.23 (edge of small).
- Token economy: Comprehensive consumed ~75× more procedural-context tokens than No‑Skills for a non‑significant +8.9 pp—making No‑Skills the cost‑efficient engineering choice in this domain.
- Non‑monotonicity / negative delta: in a timing side‑channel task, adding experiential lessons hurt performance (false lesson propagation), showing Skills can degrade performance when they encourage inappropriate methods.
- Feedback‑bandwidth hypothesis (H1): marginal benefit of Skills inversely related to how deterministic, schema‑rich, and low‑latency the environment feedback is.
- Predictions: lowering feedback bandwidth should increase Skills benefit; tasks with dense/immediate verifiers should show smaller Skills deltas; contradictory procedural knowledge can induce negative deltas.
- Relation to prior work: complements SkillsBench by explaining domain heterogeneity—high gains in healthcare/manufacturing likely reflect low feedback bandwidth; cybersecurity here is high‑bandwidth and shows small/no gains.
Data & Methods
- Source: Reanalysis of an MCP‑grounded autonomous CTF agent study (15 multi‑phase challenges across memory, reverse engineering, web exploitation, cryptography).
- Design: 15 challenges × 4 documentation conditions × 3 independent trials = 180 trajectories. Model and tool layer (Claude Sonnet 4.5, MCP servers exposing Nmap, Ghidra, Angr, GDB with strict JSON schemas) held constant.
- Documentation conditions (lines of procedural context): 55 (Minimal/No‑Skills), 1,478 (Experiential), 1,976 (Curated), 4,147 (Comprehensive).
- Outcomes: pass/fail per trajectory; solve times. Statistical tests: χ2 for independence, Cochran–Armitage trend test, Kruskal–Wallis for durations, Cohen’s h for effect sizes.
- Limitations: single backbone model, 15 challenges with 3 trials each (statistical power limited), domain specific (offensive cybersecurity), observational reanalysis rather than prospective cross‑domain manipulation of feedback bandwidth.
Implications for AI Economics
- Marginal value and substitute goods: Investment in procedural Skills is a substitute, not an additive input, with its marginal ROI strongly dependent on environment feedback bandwidth. Where tooling/verifiers provide high‑quality feedback, investments in Skills yield low marginal returns.
- Cost‑benefit engineering rule: compute cost per percentage‑point improvement (or per expected utility unit). In the studied domain, the token/context cost of Comprehensive Skills (≈75×) is not justified by the small, statistically weak performance gain—favor investments that improve tool grounding (schema fidelity, verifier speed) first.
- Product and marketplace design:
- Pricing and curation of Skills marketplaces should be domain‑sensitive: Skills for low‑feedback‑bandwidth domains (healthcare, manufacturing, complex enterprise workflows) are higher value and can command higher prices; in high‑bandwidth domains the market value should be lower.
- Platform incentives: allocate engineering resources and platform primitives toward improving deterministic, structured tooling and verifiers where feasible, since this can substitute for costly skill production.
- Portfolio and resource allocation: Organizations should treat Skills development as a contingent investment. Prioritize (in order):
- Improve tool interfaces (schema enforcement, low latency verifiers).
- Only then assess residual failure modes to justify curated Skills. This reduces redundant spending on Skill authoring when tooling would deliver larger marginal gains.
- Risk management and externalities: Bad or misaligned Skills can actively harm performance (negative deltas). Skills marketplaces should support provenance, validation against verifiers, and domain‑specific testing to avoid deploying harmful procedural packages.
- Evaluation and procurement metrics: Move beyond aggregate pass‑rate deltas to cost‑aware metrics (e.g., cost-per-pp, expected value of information from added procedural context) and report Skill efficacy conditional on environment feedback profile.
- Research & policy priorities: Fund comparative experiments that vary feedback bandwidth (schema vs raw, fast vs slow verifiers) to quantify substitution elasticities between Skills and tooling—these elasticities determine optimal investment mixes for firms and public purchasers of AI systems.
Summary takeaway: Agent Skills are not universally high‑ROI; their economic value depends on the environment's ability to provide structured, fast feedback. Firms and platforms should treat Skills as a domain‑conditional, compensatory lever and prioritize improvements to environment/tooling when feedback bandwidth can be raised cost‑effectively.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2 percentage points across diverse domains. Output Quality | positive | high | task pass rate (task success rate) |
n=84
16.2 percentage points
0.48
|
| In those same benchmarks, 16 of 84 tasks suffered negative deltas when Skills are introduced. Output Quality | negative | high | task-level performance delta when Skills are introduced (negative change in pass rate) |
n=84
16 of 84 tasks
0.48
|
| We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions (55, 1,478, 1,976, and 4,147 lines). Other | null_result | high | reanalysis dataset size and documentation-condition line counts |
n=180
0.8
|
| Those four documentation conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. Other | null_result | high | mapping of documentation richness to Skill-ablation categories |
n=180
0.48
|
| In offensive cybersecurity, the marginal benefit of Skills collapses: the spread between the no-Skills and full-Skills conditions is only 8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test; five of six pairwise Cohen's h values fall below the 0.2 small-effect threshold). Output Quality | null_result | high | task pass rate (success rate) in Capture-the-Flag offensive cybersecurity tasks |
n=180
8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage; five of six pairwise Cohen's h < 0.2)
0.8
|
| When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. Task Allocation | null_result | medium | degree to which environment feedback substitutes for Skills (procedural correction signal) |
0.05
|
| As a result (of high environment-feedback bandwidth), the marginal benefit of curated Skills diminishes substantially and, in some cases (e.g., our timing side-channel setting), actively degrades performance. Output Quality | negative | medium | task performance (including degradation in timing side-channel setting) when Skills are added |
0.29
|
| The community has not yet articulated a clean mechanism for when Skills help and when they are merely redundant overhead. Governance And Regulation | null_result | medium | state of community understanding / theoretical mechanism for Skill utility |
0.14
|
| We will release the reanalysis pipeline to support replication. Other | null_result | high | availability of reanalysis pipeline (planned release) |
0.24
|