Prepackaged 'skills' for coding agents rarely move the needle: in a 565-task benchmark across 49 skills, most skills produced no test-pass improvements and only a handful delivered substantial gains, with some even harming outcomes when guidance conflicted with project context.
Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.
Summary
Main Finding
Agent "skills"—structured procedural knowledge packages injected at inference time—provide only limited marginal utility in end-to-end, real-world software engineering workflows. In a controlled, requirement-driven benchmark (SWE-Skills-Bench), most skills produced no measurable improvement in automated acceptance testing; a small subset produced meaningful gains, and a few even degraded performance. Token cost overheads vary widely and often do not correlate with benefit.
Key Points
- Benchmark scale and scope
- 49 public SWE skills evaluated.
- ~565 task instances drawn from authentic GitHub repositories pinned at fixed commits, paired with requirement documents that include explicit acceptance criteria.
- Tasks span six real-world software engineering subdomains.
- Evaluation approach
- Deterministic verification framework maps acceptance criteria to execution-based tests, enabling objective pass/fail measurement.
- Controlled paired evaluation: each task is run with and without the skill to isolate marginal effect.
- Main quantitative outcomes
- 39 of 49 skills produced zero pass-rate improvement.
- Average pass-rate gain across skills: +1.2%.
- Seven specialized skills produced meaningful gains (up to +30% pass-rate improvement).
- Three skills degraded performance (up to −10%), typically due to version-mismatched guidance that conflicted with project context.
- Token overheads varied from modest savings to a 451% increase, often with no corresponding pass-rate improvement.
- Qualitative insight
- Skill utility is highly conditional on domain fit, the abstraction level of the skill (high-level strategy vs. low-level concrete steps), and contextual compatibility (project dependencies, versions, coding conventions).
Data & Methods
- Dataset construction
- Paired public SWE skills with authentic GitHub repositories pinned to fixed commits to prevent drift and ensure reproducibility.
- Requirement documents with explicit acceptance criteria created for each task instance, enabling deterministic testing.
- Total: ~565 task instances across six SWE subdomains (authors provide full listing in the repository).
- Verification framework
- Deterministic mapping from acceptance criteria to executable tests (unit/integration/behavioral as appropriate).
- Execution-based verification used to produce binary pass/fail outcomes per task.
- Experimental protocol
- Paired trials: identical prompts and environment, once with the skill injected at inference time and once without it, isolating marginal effect.
- Measured outcomes: pass-rate delta (with vs without), token usage (overhead or savings), and observed failure modes (e.g., conflicting guidance, version mismatch).
- Reproducibility
- SWE-Skills-Bench, test harness, and artifacts are publicly available: https://github.com/GeniusHTX/SWE-Skills-Bench
Implications for AI Economics
- Marginal returns are small and highly uneven
- Most skills offer near-zero marginal utility; a few specialized skills produce significant gains. Investments in skill creation should be targeted and validated against real tasks, not broadly assumed beneficial.
- Cost-benefit considerations must include token and runtime costs
- Token overheads can be large (up to +451%) with no accuracy improvement. Economic evaluations should compute cost per additional passed task (tokens × price / Δpasses) rather than treating skills as free add-ons.
- Productization and skill marketplaces
- Market value for skills will depend on domain specificity and compatibility assurances (e.g., supported language versions, dependency constraints). General-purpose skills are less likely to command high prices.
- Procurement and deployment decisions
- Organizations should require benchmarked, requirement-driven evidence of marginal utility before deploying skills in production workflows. Blindly adding skills can increase operational costs or reduce reliability (version-mismatch harms).
- Labor and labor-substitution claims tempered
- Narrow, context-sensitive gains suggest that expectations for broad automation of SWE through off-the-shelf skills are optimistic. Human expertise in context adaptation remains valuable.
- Incentives for better skill design
- Greater returns are likely for skills that encode context-aware, version-compatible, low-level operations tied to concrete acceptance tests. Economic incentives (pricing, contracts) should favor verifiable outcomes.
- Research and policy
- Benchmarks like SWE-Skills-Bench are crucial infrastructure for measuring real-world returns and for informing standardization (skill metadata, versioning, compatibility metadata). Regulators and purchasers should prefer verifiable performance claims.
- Suggested economic metrics to adopt
- Δpass-rate per unit token cost (or $), expected improvement conditional on domain, probability of negative impact due to mismatch, and "compatibility score" reflecting alignment with project context and dependencies.
If you want, I can: - Calculate example cost-per-pass metrics using assumed token prices; - Propose a simple decision rule (thresholds) for whether to adopt a skill given measured pass-rate delta and token overhead; - Extract further recommendations for skill marketplace design (metadata, SLAs, versioning).
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| SWE-Skills-Bench is the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). Other | null_result | medium | existence/novelty of a requirement-driven benchmark for evaluating marginal utility of agent skills |
0.29
|
| SWE-Skills-Bench pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. Other | null_result | high | number of skill-repo-task instances (~565) and coverage across six subdomains |
n=565
0.48
|
| The authors introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Other | null_result | high | ability to deterministically verify task acceptance criteria via execution-based tests and support paired evaluation |
0.48
|
| Skill injection benefits are far more limited than rapid adoption suggests. Output Quality | negative | medium | marginal utility of skill injection measured as change in acceptance-test pass rate |
n=565
0.29
|
| 39 of 49 skills yield zero pass-rate improvement. Output Quality | null_result | high | change in task acceptance-test pass rate (zero improvement) |
n=49
0.48
|
| The average gain from injecting skills is only +1.2% in pass rate. Output Quality | positive | high | average change in acceptance-test pass rate (+1.2%) |
n=565
0.48
|
| Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Organizational Efficiency | mixed | high | token usage/overhead (percent change) and its relation to pass rates |
0.48
|
| Only seven specialized skills produce meaningful gains (up to +30%). Output Quality | positive | high | number of skills with meaningful positive pass-rate gains and magnitude (up to +30%) |
n=49
0.48
|
| Three skills degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. Output Quality | negative | medium | pass-rate decrease (up to -10%) and qualitative cause attribution (version/context mismatch) |
n=49
0.29
|
| These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. Other | mixed | medium | qualitative assessment of conditions affecting utility of agent skills (domain fit, abstraction level, contextual compatibility) |
0.29
|
| SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. Other | null_result | high | availability of a benchmarking testbed for evaluating agent skills |
0.48
|
| SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench. Other | null_result | high | public availability (URL) of the benchmark |
0.48
|