Human-authored procedural 'Skills' lift LLM agent success by 16.2 percentage points on average—gains vary sharply by domain and sometimes harm performance—while model-generated Skills add no net value; narrowly targeted Skills let smaller models match larger ones, suggesting firms can substitute curated knowledge for compute.

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee · February 13, 2026

manual quasi_experimental medium evidence 8/10 relevance Source PDF

Human-curated procedural 'Skills' raise LLM agent task pass rates by an average of 16.2 percentage points (with large domain heterogeneity), while model-self-authored Skills show no average benefit and focused small Skills outperform comprehensive documentation.

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

Summary

Main Finding

Curated, human-authored Agent Skills—compact, procedural packages of instructions, templates, and executable resources—substantially improve LLM-agent task success on average (+16.2 percentage points in pass rate across configurations), but effects are highly heterogeneous across domains, tasks, and agent–model combos. Self-generated Skills (prompts asking the model to author its own procedural knowledge) provide no reliable benefit. Focused Skills (2–3 modules) outperform long, comprehensive documentation, and smaller models augmented with curated Skills can match or approach larger-model performance without Skills.

Key Points

Benchmark: SKILLSBENCH — 84 tasks spanning 11 domains (healthcare, software engineering, finance, robotics, etc.), using deterministic verifiers and human-authored instructions; tasks stratified by difficulty (core/extended/extreme).
Evaluation matrix: 7 agent–model configurations (Claude Opus/Haiku/Sonnet variants, Gemini 3 Pro/Flash, GPT‑5.2/Codex) across 3 Skills conditions (No Skills, Curated Skills, Self-Generated Skills), totaling 7,308 valid trajectories.
Primary metric: binary pass rate (averaged across trials and tasks). Also report normalized gain to measure proportional improvement toward perfect performance.
Main quantitative results:
- Curated Skills increase average pass rate by ~16.2 percentage points.
- Domain heterogeneity: improvements range widely (paper reports examples from +4.5 pp in Software Engineering up to +51.9 pp in Healthcare); 16 of 84 tasks show negative deltas when Skills are added.
- Self-generated Skills yield no average benefit (often negligible or negative; models frequently produce imprecise/incomplete procedures).
- Focused, modular Skills (2–3 well-designed modules) outperform comprehensive, verbose documentation.
- Smaller models with curated Skills can match larger-model baselines without Skills—implying substitution value of Skills for model capacity.
Design/quality controls: Skills are defined to be procedural (not factual retrieval), portable, and modular (SKILL.md + optional resources). Submissions and benchmark tasks underwent automated validation (structural checks, oracle-running) and human review (data realism, anti-cheating, leakage audits).

Data & Methods

Dataset construction:
- Community-sourced: 322 candidate tasks from 105 contributors; after automated + human filtering, 84 tasks retained.
- Skills corpus aggregated from multiple sources and deduplicated (report lists ~47k unique Skills initially aggregated across sources).
- Tasks packaged as isolated Docker environments including instruction.md, environment data, oracle solution, and deterministic pytest-style verifiers.
Skills conditions:
- No Skills: only instruction.md present.
- Curated Skills: environment/skills/ populated with SKILL.md and resources.
- Self-Generated Skills: agent prompted to generate procedural guidance before attempting task (evaluated for 5 of 7 configurations).
Agents & models:
- Harnesses: Claude Code (Anthropic), Gemini CLI (Google), Codex CLI (OpenAI).
- Models: GPT‑5.2 (Codex harness), Claude Opus 4.5/4.6, Sonnet 4.5, Haiku 4.5, Gemini 3 Pro, Gemini 3 Flash.
- Deterministic decoding (temperature = 0).
Evaluation protocol:
- For each task-condition-model, agents run until pass/fail/timeout; verifiers produce deterministic binary outcomes.
- Aggregation: per-task average over 5 trials, then average across 84 tasks to compute pass rates; normalized gain computed as (pass_skill − pass_vanilla) / (1 − pass_vanilla).
Key methodological safeguards:
- Skills must not encode instance-specific solutions (leakage audits).
- All instructions human-authored (both human review and AI-detection checks).
- Oracles required to pass verifiers before task acceptance.

Implications for AI Economics

Cost-effectiveness and substitution:
- Skills act as a low-cost, reusable augmentation layer that can substitute for model capacity in many cases. Firms could choose cheaper model/harness subscriptions plus curated Skills over higher-cost frontier models, changing demand for high-capacity model-time.
- Evaluation suggests a shifted Pareto frontier: better pass rates for a given compute/cost when Skills are used—informing procurement and pricing strategies (e.g., bundled Skill marketplaces vs. raw model access).
Market and platform design:
- High returns to skilled human curation: curated Skills outperform self-authored ones, indicating market opportunities for expert-created Skills, certification services, and marketplaces (quality signals/certification will matter).
- Heterogeneous ROI across domains implies differentiated demand: domains with large Skill gains (e.g., healthcare) will justify higher payments for vetted Skills and governance; low-gain domains may not.
Labor and productivity:
- Skills lower the technical threshold for agents to execute domain workflows, potentially amplifying worker productivity and changing comparative advantage (e.g., junior staff + curated Skills may substitute for more experienced workers on standardized tasks).
- Because 16/84 tasks showed negative effects, Skill design and deployment risk productivity regressions—organizations must measure value in situ and guard against misplaced trust in Skills.
Platform competition and bundling:
- Agent harnesses with native Skills integration (e.g., Claude Code) show better utilization—platforms that support Skill APIs and standardized packaging can exert competitive advantage and capture value from Skill ecosystems.
- Providers may experiment with pricing: charging for premium Skills, curated bundles, or Skill hosting/validation services.
Policy, standards, and public goods:
- Benchmarks like SKILLSBENCH enable standardized ROI estimation for Skills investments; regulators and procurement officers can use such metrics for due diligence in safety-critical domains (healthcare, finance).
- Public releases of benchmark datasets and verification harnesses lower entry barriers for evaluators and accelerate standard-setting; public-good Skills (open, vetted) could be socially valuable but also create externalities (misuse risks) that warrant governance.
Research & measurement recommendations:
- Economic evaluations should measure cost-per-solved-task (compute + human curation + verification) and marginal returns to Skill investment per domain.
- Adoption studies should segment by domain, task complexity, and organizational integration costs, because aggregate averages mask high heterogeneity.
- Monitor complementarities: investments in tooling (harnesses, Skill verifiers) and certification may increase Skill utilization and market value more than marginal model improvements.

Short actionable takeaways for practitioners and researchers: - Invest in curated, focused Skills for domain workflows where pass-rate gains are large—these can be high-return relative to buying higher-capacity model access. - Do not rely on models to self-author robust procedural knowledge—plan for human curation, quality control, and verification. - Use deterministic, execution-based verification (as in SKILLSBENCH) to measure real-world effectiveness before deploying Skills in production, especially in high-stakes domains.

Dataset, code, and evaluation harness released by authors at skillsbench.ai (paper preprint: March 16, 2026).

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The benchmark is large, multi-domain, and uses deterministic verifiers and many execution traces, giving strong internal evidence that curated Skills improve benchmark pass rates; however, external validity is limited (labor/economic impacts are inferred rather than observed), curated Skills may reflect researcher selection/tuning, and heterogeneous/negative task-level effects complicate general conclusions about real-world productivity or market outcomes. Methods Rigorhigh — Carefully controlled experimental conditions (baseline vs curated vs self-generated), objective deterministic verifiers, broad task coverage (86 tasks, 11 domains), multiple agent–model configurations, and thousands of trajectories indicate high methodological rigor for measuring agent performance within the benchmark; potential weaknesses include possible selection/authoring bias in curated Skills and absence of field/deployment validation. SampleBenchmark dataset of 86 tasks spanning 11 domains, each paired with deterministic pass/fail verifiers and both curated (human-authored) and model-generated Skills; evaluated across seven agent–model configurations with 7,308 total execution trajectories used to compute task pass rates and condition deltas. Themesproductivity skills_training human_ai_collab adoption org_design IdentificationWithin-task controlled comparisons across three conditions (no Skills, human-curated Skills, model-self-authored Skills) using deterministic automated verifiers to measure pass/fail outcomes; effects estimated from differences in pass rates across conditions for the same tasks, models, and agent configurations (7 agent–model setups, 7,308 execution traces) to isolate the impact of Skills on agent task success. GeneralizabilityBenchmark tasks may not reflect complexity and noise of real-world workflows or human-in-the-loop evaluation, Curated Skills were authored/selected by researchers—quality and transferability of externally produced Skills may vary, Deterministic verifiers capture objective correctness but miss subjective, safety, or long-horizon outcomes important in practice, Models and agent architectures tested may not represent the full set of production models or future larger/specialized models, Economic implications (costs of Skill authoring, market adoption frictions, firm incentives) are not directly measured, Heterogeneous and negative effects on some tasks limit simple aggregation to sector-level productivity forecasts

Claims (14)

Claim	Direction	Confidence	Outcome	Details
Curated (human-authored) Skills substantially improve agent task success on average (+16.2 percentage points). Other	positive	high	task pass rate (percentage of trajectories passing deterministic verifier)	n=7308 +16.2 percentage points 0.48
Effects of curated Skills are highly heterogeneous across domains (e.g., +4.5 pp in Software Engineering vs. +51.9 pp in Healthcare). Other	mixed	high	task pass rate (per-domain average delta)	n=7308 +4.5 pp (Software) ; +51.9 pp (Healthcare) 0.48
Self-generated (model-authored) Skills provide no average benefit. Other	null_result	medium	task pass rate (average delta for self-authored Skills vs. baseline)	n=7308 no average benefit 0.29
Focused, small Skills (2–3 modules) are more effective than comprehensive documentation-style Skills. Other	positive	medium	task pass rate (comparison by Skill granularity)	n=7308 0.29
Smaller models augmented with curated Skills can match the performance of larger models without Skills (model–skill tradeoff). Other	mixed	medium	task pass rate (cross-model comparisons)	n=7308 0.29
SkillsBench benchmark: evaluates 86 tasks spanning 11 domains with deterministic, automated verifiers. Other	null_result	high	benchmark composition and verification method (not an outcome variable)	n=86 0.48
Each task was evaluated under three conditions: (1) no Skills, (2) curated (human-authored) Skills, and (3) self-authored (model-generated) Skills. Other	null_result	high	experimental condition (not an outcome variable)	n=86 0.48
Scale of experiments: seven agent–model configurations and 7,308 execution trajectories were used to compute pass rates and deltas. Other	null_result	high	sample size / number of trajectories (not an outcome variable)	n=7308 0.48
Deterministic automated verifiers provide objective pass/fail checks for task success. Other	null_result	high	verification result (pass/fail)	0.48
In some tasks, curated Skills worsened performance: 16 of 84 tasks showed negative deltas. Other	negative	medium	task pass rate (per-task delta)	n=84 16 of 84 tasks had negative deltas 0.29
The inability of models to reliably self-author useful Skills implies that models typically cannot produce the procedural knowledge they would benefit from consuming. Other	negative	medium	quality/usefulness of model-authored Skills as measured by downstream task pass rate	n=7308 0.29
Because curated Skills yield large average gains, human curation of high-quality procedural knowledge has economic value and could be a high-return activity. Firm Revenue	positive	speculative	implied economic value / returns to human Skill authoring (inferred, not directly measured)	n=7308 +16.2 pp improvement implies economic value (inferred) 0.05
Focused, modular Skill design favors modular pricing and bundling strategies (i.e., narrow high-impact Skills premium; broad libraries lower margin). Market Structure	positive	speculative	market/pricing implications (inferred from effectiveness by Skill granularity)	n=7308 0.05
Deterministic verifiers and benchmarks like SkillsBench are important for certification and procurement decisions because they enable verifiable, repeatable gains. Regulatory Compliance	positive	speculative	reliability/verifiability for procurement (inferred, not directly measured)	0.05

Entities

Skills (method) Curated Skills (human-authored) (method) SkillsBench (dataset) Deterministic verifiers (automated pass/fail checks) (method) Task pass rate (outcome) Average pass-rate increase (16.2 percentage points) (outcome) LLM agents (ai_tool) Self-generated Skills (model-authored) (method) 86 benchmark tasks (dataset) 11 domains (benchmark coverage) (population) Focused Skills (2–3 modules) (method) Model–Skill tradeoff (skill augmentation substitutes for model scale) (outcome) Healthcare domain (population) Skill engineering (human curation of procedural knowledge) (method) 7,308 execution trajectories (dataset) Comprehensive documentation-style Skills (method) Agent–model configurations (7 configurations) (method) Software engineering domain (population) Negative task deltas (16 of 84 tasks where curated Skills decreased performance) (outcome)