The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Structured 'Skills' provide little marginal gain for an autonomous cybersecurity agent: across 180 CTF runs the full-Skills condition outperformed no-Skills by only 8.9 percentage points (statistically insignificant), and in environments that return precise, low-latency feedback curated Skills often cease to help or can degrade performance.

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity
Samuel Jacob Chacko, James Hugglestone, Chashi Mahiul Islam, Xiuwen Liu · May 19, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
In a re-analysis of 180 CTF agent runs, adding structured procedural 'Skills' produced only an 8.9 percentage-point gap versus no-Skills (statistically indistinguishable) and, where the environment returned high-bandwidth, schema-validated observations, the marginal benefit of Skills largely vanished and sometimes harmed performance.

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

Summary

Main Finding

A controlled re-analysis of a 180-run study of an MCP-grounded autonomous Capture‑The‑Flag (CTF) agent shows that adding curated Agent Skills yields at most an 8.9 percentage‑point (pp) increase in pass rate versus a No‑Skills baseline—and that increase is not statistically significant (χ2 p = 0.71; Cochran–Armitage p = 0.25). In this high-feedback‑bandwidth domain (schema‑validated, low‑latency tool outputs), the marginal benefit of Skills collapses relative to previously reported cross‑domain averages (SkillsBench average +16.2 pp). The authors propose the "feedback‑bandwidth" hypothesis: Skills are less valuable when the environment provides deterministic, structured, and timely corrective feedback.

Key Points

  • Experimental mapping: four documentation conditions correspond to No‑Skills, Experiential‑Skills, Curated‑Skills, and Comprehensive‑Skills (55, 1,478, 1,976, 4,147 lines respectively).
  • Pass rates (45 trials per condition):
    • No‑Skills: 77.8% (35/45), mean time 20.1 min
    • Experiential: 82.2% (37/45), mean time 19.1 min (+4.4 pp)
    • Curated: 84.4% (38/45), mean time 18.5 min (+6.6 pp)
    • Comprehensive: 86.7% (39/45), mean time 17.1 min (+8.9 pp)
  • Statistical effect sizes: five of six pairwise Cohen’s h < 0.2 (small); only No‑Skills vs Comprehensive at h = 0.23 (edge of small).
  • Token economy: Comprehensive consumed ~75× more procedural-context tokens than No‑Skills for a non‑significant +8.9 pp—making No‑Skills the cost‑efficient engineering choice in this domain.
  • Non‑monotonicity / negative delta: in a timing side‑channel task, adding experiential lessons hurt performance (false lesson propagation), showing Skills can degrade performance when they encourage inappropriate methods.
  • Feedback‑bandwidth hypothesis (H1): marginal benefit of Skills inversely related to how deterministic, schema‑rich, and low‑latency the environment feedback is.
    • Predictions: lowering feedback bandwidth should increase Skills benefit; tasks with dense/immediate verifiers should show smaller Skills deltas; contradictory procedural knowledge can induce negative deltas.
  • Relation to prior work: complements SkillsBench by explaining domain heterogeneity—high gains in healthcare/manufacturing likely reflect low feedback bandwidth; cybersecurity here is high‑bandwidth and shows small/no gains.

Data & Methods

  • Source: Reanalysis of an MCP‑grounded autonomous CTF agent study (15 multi‑phase challenges across memory, reverse engineering, web exploitation, cryptography).
  • Design: 15 challenges × 4 documentation conditions × 3 independent trials = 180 trajectories. Model and tool layer (Claude Sonnet 4.5, MCP servers exposing Nmap, Ghidra, Angr, GDB with strict JSON schemas) held constant.
  • Documentation conditions (lines of procedural context): 55 (Minimal/No‑Skills), 1,478 (Experiential), 1,976 (Curated), 4,147 (Comprehensive).
  • Outcomes: pass/fail per trajectory; solve times. Statistical tests: χ2 for independence, Cochran–Armitage trend test, Kruskal–Wallis for durations, Cohen’s h for effect sizes.
  • Limitations: single backbone model, 15 challenges with 3 trials each (statistical power limited), domain specific (offensive cybersecurity), observational reanalysis rather than prospective cross‑domain manipulation of feedback bandwidth.

Implications for AI Economics

  • Marginal value and substitute goods: Investment in procedural Skills is a substitute, not an additive input, with its marginal ROI strongly dependent on environment feedback bandwidth. Where tooling/verifiers provide high‑quality feedback, investments in Skills yield low marginal returns.
  • Cost‑benefit engineering rule: compute cost per percentage‑point improvement (or per expected utility unit). In the studied domain, the token/context cost of Comprehensive Skills (≈75×) is not justified by the small, statistically weak performance gain—favor investments that improve tool grounding (schema fidelity, verifier speed) first.
  • Product and marketplace design:
    • Pricing and curation of Skills marketplaces should be domain‑sensitive: Skills for low‑feedback‑bandwidth domains (healthcare, manufacturing, complex enterprise workflows) are higher value and can command higher prices; in high‑bandwidth domains the market value should be lower.
    • Platform incentives: allocate engineering resources and platform primitives toward improving deterministic, structured tooling and verifiers where feasible, since this can substitute for costly skill production.
  • Portfolio and resource allocation: Organizations should treat Skills development as a contingent investment. Prioritize (in order):
  • Improve tool interfaces (schema enforcement, low latency verifiers).
  • Only then assess residual failure modes to justify curated Skills. This reduces redundant spending on Skill authoring when tooling would deliver larger marginal gains.
  • Risk management and externalities: Bad or misaligned Skills can actively harm performance (negative deltas). Skills marketplaces should support provenance, validation against verifiers, and domain‑specific testing to avoid deploying harmful procedural packages.
  • Evaluation and procurement metrics: Move beyond aggregate pass‑rate deltas to cost‑aware metrics (e.g., cost-per-pp, expected value of information from added procedural context) and report Skill efficacy conditional on environment feedback profile.
  • Research & policy priorities: Fund comparative experiments that vary feedback bandwidth (schema vs raw, fast vs slow verifiers) to quantify substitution elasticities between Skills and tooling—these elasticities determine optimal investment mixes for firms and public purchasers of AI systems.

Summary takeaway: Agent Skills are not universally high‑ROI; their economic value depends on the environment's ability to provide structured, fast feedback. Firms and platforms should treat Skills as a domain‑conditional, compensatory lever and prioritize improvements to environment/tooling when feedback bandwidth can be raised cost‑effectively.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The analysis uses a reasonably sized controlled runset (180 runs) and appropriate statistical tests to compare explicit ablation conditions, and finds small and statistically indistinguishable differences; however effects are domain- and architecture-specific, many tests are non-significant, the moderator (environment-feedback bandwidth) is argued rather than cleanly randomized, and external validity beyond the MCP-grounded CTF setting is limited. Methods Rigormedium — Strengths include a clear ablation design, use of multiple statistical tests and effect-size metrics, and re-analysis transparency (pipeline release); weaknesses include limited scope (single domain and agent design), mostly post-hoc moderation inference rather than a fully factorial manipulation of feedback bandwidth, non-significant p-values for main contrasts, and potential confounds from task heterogeneity and measurement choices. Sample180 experimental runs of an MCP-grounded autonomous Capture-the-Flag (offensive cybersecurity) agent executed under four documentation/documentation-length conditions corresponding to No-Skills (55 lines), Experiential-Skills (1,478 lines), Curated-Skills (1,976 lines), and Comprehensive-Skills (4,147 lines); outcome reported as task pass/success rates across tasks including a timing side-channel setting; statistical comparisons use χ², Cochran–Armitage trend test, and Cohen's h for pairwise effects. Themesproductivity skills_training IdentificationComparative ablation across four documentation/Skills conditions (No-Skills, Experiential-, Curated-, and Comprehensive-Skills) in a 180-run controlled Capture-the-Flag agent study, using between-condition contrasts and trend tests (χ² tests, Cochran–Armitage) and effect-size measures (Cohen's h) to attribute differences in task pass rates to the presence/extent of Skills; supplemented by a post-hoc moderator argument that environment-feedback bandwidth explains heterogeneity in effects. GeneralizabilitySingle domain (offensive cybersecurity CTF) — may not generalize to other task domains (e.g., coding, writing, customer service), Single agent architecture (MCP-grounded agent) and specific tool/interface behavior — different agent designs might interact with Skills differently, Environmental feedback characteristics (schema-validated, low-latency observations) are particular to this setup; results hinge on this moderator and may reverse with lower-bandwidth feedback, Sample size is moderate but condition-level power may be limited; heterogeneous task difficulty could mask effects, Findings are from controlled runs rather than field/economy-level outcomes, so implications for labor, firm productivity, or wages are indirect

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2 percentage points across diverse domains. Output Quality positive high task pass rate (task success rate)
n=84
16.2 percentage points
0.48
In those same benchmarks, 16 of 84 tasks suffered negative deltas when Skills are introduced. Output Quality negative high task-level performance delta when Skills are introduced (negative change in pass rate)
n=84
16 of 84 tasks
0.48
We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions (55, 1,478, 1,976, and 4,147 lines). Other null_result high reanalysis dataset size and documentation-condition line counts
n=180
0.8
Those four documentation conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. Other null_result high mapping of documentation richness to Skill-ablation categories
n=180
0.48
In offensive cybersecurity, the marginal benefit of Skills collapses: the spread between the no-Skills and full-Skills conditions is only 8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test; five of six pairwise Cohen's h values fall below the 0.2 small-effect threshold). Output Quality null_result high task pass rate (success rate) in Capture-the-Flag offensive cybersecurity tasks
n=180
8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage; five of six pairwise Cohen's h < 0.2)
0.8
When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. Task Allocation null_result medium degree to which environment feedback substitutes for Skills (procedural correction signal)
0.05
As a result (of high environment-feedback bandwidth), the marginal benefit of curated Skills diminishes substantially and, in some cases (e.g., our timing side-channel setting), actively degrades performance. Output Quality negative medium task performance (including degradation in timing side-channel setting) when Skills are added
0.29
The community has not yet articulated a clean mechanism for when Skills help and when they are merely redundant overhead. Governance And Regulation null_result medium state of community understanding / theoretical mechanism for Skill utility
0.14
We will release the reanalysis pipeline to support replication. Other null_result high availability of reanalysis pipeline (planned release)
0.24

Notes