The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Prepackaged 'skills' for coding agents rarely move the needle: in a 565-task benchmark across 49 skills, most skills produced no test-pass improvements and only a handful delivered substantial gains, with some even harming outcomes when guidance conflicted with project context.

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?
Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu · March 16, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
In a controlled benchmark of ~565 real-world development tasks, injecting prebuilt agent skills yielded a small average pass-rate gain (+1.2%), with 39 of 49 skills giving no improvement and only seven producing meaningful gains (up to +30%), while some skills decreased performance due to context/version mismatches.

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

Summary

Main Finding

Agent "skills"—structured procedural knowledge packages injected at inference time—provide only limited marginal utility in end-to-end, real-world software engineering workflows. In a controlled, requirement-driven benchmark (SWE-Skills-Bench), most skills produced no measurable improvement in automated acceptance testing; a small subset produced meaningful gains, and a few even degraded performance. Token cost overheads vary widely and often do not correlate with benefit.

Key Points

  • Benchmark scale and scope
    • 49 public SWE skills evaluated.
    • ~565 task instances drawn from authentic GitHub repositories pinned at fixed commits, paired with requirement documents that include explicit acceptance criteria.
    • Tasks span six real-world software engineering subdomains.
  • Evaluation approach
    • Deterministic verification framework maps acceptance criteria to execution-based tests, enabling objective pass/fail measurement.
    • Controlled paired evaluation: each task is run with and without the skill to isolate marginal effect.
  • Main quantitative outcomes
    • 39 of 49 skills produced zero pass-rate improvement.
    • Average pass-rate gain across skills: +1.2%.
    • Seven specialized skills produced meaningful gains (up to +30% pass-rate improvement).
    • Three skills degraded performance (up to −10%), typically due to version-mismatched guidance that conflicted with project context.
    • Token overheads varied from modest savings to a 451% increase, often with no corresponding pass-rate improvement.
  • Qualitative insight
    • Skill utility is highly conditional on domain fit, the abstraction level of the skill (high-level strategy vs. low-level concrete steps), and contextual compatibility (project dependencies, versions, coding conventions).

Data & Methods

  • Dataset construction
    • Paired public SWE skills with authentic GitHub repositories pinned to fixed commits to prevent drift and ensure reproducibility.
    • Requirement documents with explicit acceptance criteria created for each task instance, enabling deterministic testing.
    • Total: ~565 task instances across six SWE subdomains (authors provide full listing in the repository).
  • Verification framework
    • Deterministic mapping from acceptance criteria to executable tests (unit/integration/behavioral as appropriate).
    • Execution-based verification used to produce binary pass/fail outcomes per task.
  • Experimental protocol
    • Paired trials: identical prompts and environment, once with the skill injected at inference time and once without it, isolating marginal effect.
    • Measured outcomes: pass-rate delta (with vs without), token usage (overhead or savings), and observed failure modes (e.g., conflicting guidance, version mismatch).
  • Reproducibility
    • SWE-Skills-Bench, test harness, and artifacts are publicly available: https://github.com/GeniusHTX/SWE-Skills-Bench

Implications for AI Economics

  • Marginal returns are small and highly uneven
    • Most skills offer near-zero marginal utility; a few specialized skills produce significant gains. Investments in skill creation should be targeted and validated against real tasks, not broadly assumed beneficial.
  • Cost-benefit considerations must include token and runtime costs
    • Token overheads can be large (up to +451%) with no accuracy improvement. Economic evaluations should compute cost per additional passed task (tokens × price / Δpasses) rather than treating skills as free add-ons.
  • Productization and skill marketplaces
    • Market value for skills will depend on domain specificity and compatibility assurances (e.g., supported language versions, dependency constraints). General-purpose skills are less likely to command high prices.
  • Procurement and deployment decisions
    • Organizations should require benchmarked, requirement-driven evidence of marginal utility before deploying skills in production workflows. Blindly adding skills can increase operational costs or reduce reliability (version-mismatch harms).
  • Labor and labor-substitution claims tempered
    • Narrow, context-sensitive gains suggest that expectations for broad automation of SWE through off-the-shelf skills are optimistic. Human expertise in context adaptation remains valuable.
  • Incentives for better skill design
    • Greater returns are likely for skills that encode context-aware, version-compatible, low-level operations tied to concrete acceptance tests. Economic incentives (pricing, contracts) should favor verifiable outcomes.
  • Research and policy
    • Benchmarks like SWE-Skills-Bench are crucial infrastructure for measuring real-world returns and for informing standardization (skill metadata, versioning, compatibility metadata). Regulators and purchasers should prefer verifiable performance claims.
  • Suggested economic metrics to adopt
    • Δpass-rate per unit token cost (or $), expected improvement conditional on domain, probability of negative impact due to mismatch, and "compatibility score" reflecting alignment with project context and dependencies.

If you want, I can: - Calculate example cost-per-pass metrics using assumed token prices; - Propose a simple decision rule (thresholds) for whether to adopt a skill given measured pass-rate delta and token overhead; - Extract further recommendations for skill marketplace design (metadata, SLAs, versioning).

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The benchmark uses controlled, paired evaluations and execution-based verification which give direct, task-level measures of marginal skill utility, but strength is limited by scope (49 public skills, ~565 tasks), potential sensitivity to the particular LLM(s), repository selection, how acceptance criteria are translated to tests, and the fact that pass-rate is a coarse proxy for real developer productivity or firm-level economic outcomes. Methods Rigorhigh — The authors construct a reproducible benchmark with pinned commits, explicit requirement documents, and a deterministic verification framework to enable paired comparisons; they measure both correctness (pass rates) and token overhead and report negative as well as positive effects — methodologically robust for a systems/benchmark study — though the testing translation, skill selection, and single-environment dependence remain limitations. Sample49 publicly available software-engineering 'skills' paired with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, producing ~565 task instances across six software-engineering subdomains; tasks are evaluated via programmatic execution-based tests derived from acceptance criteria in paired runs with and without each skill. Themesproductivity human_ai_collab IdentificationPaired, within-task controlled comparisons: for each requirement-driven task (fixed GitHub repo and commit) the authors run the same agent with and without a single injected skill and evaluate both runs against deterministic, execution-based acceptance tests that map requirement criteria to pass/fail outcomes, isolating the marginal effect of the skill. GeneralizabilityResults may depend on the specific LLM(s) and agent runtime versions used; different model generations could change outcomes., Benchmark uses publicly available repositories and selected skills; commercial, proprietary, or internally developed skills may behave differently., Pinned-commit, single-environment setup may not capture real-world continuous development workflows, CI variability, or team processes., Evaluation focuses on test-pass rates (functional correctness) and token costs, not on developer time saved, code maintainability, or downstream economic metrics., Skill utility likely sensitive to language, framework, and project-specific context; limited coverage across SWE domains and languages., Translation of requirement acceptance criteria into executable tests could introduce measurement bias or miss qualitative improvements.

Claims (12)

ClaimDirectionConfidenceOutcomeDetails
SWE-Skills-Bench is the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). Other null_result medium existence/novelty of a requirement-driven benchmark for evaluating marginal utility of agent skills
0.29
SWE-Skills-Bench pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. Other null_result high number of skill-repo-task instances (~565) and coverage across six subdomains
n=565
0.48
The authors introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Other null_result high ability to deterministically verify task acceptance criteria via execution-based tests and support paired evaluation
0.48
Skill injection benefits are far more limited than rapid adoption suggests. Output Quality negative medium marginal utility of skill injection measured as change in acceptance-test pass rate
n=565
0.29
39 of 49 skills yield zero pass-rate improvement. Output Quality null_result high change in task acceptance-test pass rate (zero improvement)
n=49
0.48
The average gain from injecting skills is only +1.2% in pass rate. Output Quality positive high average change in acceptance-test pass rate (+1.2%)
n=565
0.48
Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Organizational Efficiency mixed high token usage/overhead (percent change) and its relation to pass rates
0.48
Only seven specialized skills produce meaningful gains (up to +30%). Output Quality positive high number of skills with meaningful positive pass-rate gains and magnitude (up to +30%)
n=49
0.48
Three skills degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. Output Quality negative medium pass-rate decrease (up to -10%) and qualitative cause attribution (version/context mismatch)
n=49
0.29
These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. Other mixed medium qualitative assessment of conditions affecting utility of agent skills (domain fit, abstraction level, contextual compatibility)
0.29
SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. Other null_result high availability of a benchmarking testbed for evaluating agent skills
0.48
SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench. Other null_result high public availability (URL) of the benchmark
0.48

Notes