The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Structured domain guidance (SKILL.md) reliably boosts LLM agents' telecom operations competence, raising success rates across models by up to ~19 percentage points; MiniMax M2.5 leads at 81.1% success with the skill augmentation.

SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations
Ivo Brett · March 16, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
Providing LLM agents with a portable SKILL.md containing structured domain guidance substantially improves their ability to complete telecom operations workflows on mock APIs, with consistent performance lifts across five open-weight models.

As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).

Summary

Main Finding

Structured, portable domain guidance (the SKILL.md artifact) materially improves the ability of general-purpose LLM agents to execute telecom operations workflows against real API interfaces. Across 185 scenario-runs over 37 operational scenarios, every evaluated open-weight model achieved higher end-to-end task success when augmented with SKILLS compared with a generic tool-enabled agent.

Key Points

  • Problem: Can LLM agents reliably perform telecom operations workflows against real APIs, or do they need explicit domain guidance?
  • Benchmark introduced: SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations).
  • Coverage: 37 telecom operations scenarios spanning 8 TM Forum Open API domains — TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724.
  • Environment: Live mock API servers seeded with production-representative data, MCP tool interfaces, and deterministic evaluation rubrics.
  • Evaluation conditions:
    • Baseline: generic agent with tool access (no domain guidance).
    • With-skill: same agent augmented with a SKILL.md document encoding workflow logic, API patterns, and business rules.
  • Deterministic evaluation combines:
    • Response content checks,
    • Tool-call verification (correct API calls, params),
    • Database state assertions (post-conditions in seeded DB).
  • Models & high-level results (185 scenario-runs, 5 open-weight model conditions):
    • MiniMax M2.5: 81.1% success with-skill (lift +13.5 percentage points).
    • Nemotron 120B: 78.4% with-skill (lift +18.9 pp).
    • GLM-5 Turbo: 78.4% with-skill (lift +5.4 pp).
    • Seed 2.0 Lite: 75.7% with-skill (lift +18.9 pp).
    • (Five model conditions were evaluated in total; the summary reports the leading four.)
  • Consistent pattern: SKILL augmentation produced a positive and meaningful performance lift across all tested models.

Data & Methods

  • Benchmark design:
    • 37 scenarios reflect common service-lifecycle operational tasks across TM Forum APIs (ordering, management, inventory, notifications, etc.).
    • Each scenario maps to live mock API endpoints that mimic real-world behavior and error modes.
    • The mock servers are seeded with production-representative datasets so agent actions have realistic side effects.
  • Tooling:
    • Agents had access to MCP-style tool interfaces to make API calls (simulate real operator integration).
    • SKILL.md is a portable document encoding: workflow logic, API call patterns, business rules, expected invariants and error-handling guidance.
  • Evaluation rubric:
    • Deterministic checks -> whether the agent produced required content, invoked correct tool calls with correct parameters, and produced the correct final database state.
    • Results are reported as scenario success rates; lifts reported are absolute percentage-point differences between baseline and with-skill conditions.
  • Models:
    • Five open-weight model conditions (four reported leaders above). All evaluated without proprietary fine-tuning; guidance was injected via SKILL.md at runtime.
  • Scale:
    • 185 total scenario-runs (multiple runs per scenario × model conditions).

Implications for AI Economics

  • Value of domain engineering > model-only improvements:
    • Providing structured domain guidance (SKILL.md) yields consistent, non-trivial performance gains across multiple LLMs. This implies a potentially high return on investment (ROI) for telecom operators who invest in codifying workflows, API patterns, and business rules as reusable artifacts rather than relying solely on more capable LLMs or costly fine-tuning.
  • Cost trade-offs:
    • Building and maintaining SKILL artifacts and integration tests requires engineering effort and subject-matter expertise. The measured lifts (up to ~19 pp) provide a quantitative basis to compare that engineering cost against alternatives (e.g., model licensing/purchase, fine-tuning, human-in-the-loop labor).
  • Productivity and labor impacts:
    • Improved autonomous execution reduces routine operator workload and error rates, enabling faster task throughput and lower operational costs. However, operators should plan for role shifts (more focus on rule authorship, validation, and oversight).
  • Standardization and portability benefits:
    • A portable SKILL.md format enables reuse across models and deployments; standardizing domain guidance could lower repeated integration costs and create competitive differentiation around tooling and process IP.
  • Deployment risk and compliance:
    • Deterministic evaluation (tool-call + DB state checks) shows that domain guidance not only improves success rates but also reduces risky miscalls. This decreases operational risk and regulatory exposure when automating critical telecom workflows.
  • Model selection vs. guidance investment:
    • Smaller or cheaper open models, when combined with strong structured guidance, can approach the performance of larger models. This changes procurement calculus: operators might prefer investing in domain artifacts over exclusively upgrading to larger model capacity.
  • Next steps for economic modeling:
    • Incorporate benchmark-derived success lifts into cost-benefit models (cost of building/maintaining SKILL vs. labor savings, error-cost reduction, SLA improvements).
    • Evaluate long-run maintenance costs (updates as APIs/business rules change) and measure amortized ROI across deployments.
    • Extend analysis to closed-weight LLMs, higher-scale scenario sets, and real production APIs to refine estimates.

Limitations / caveats (economic relevance) - Results are from mock API servers with seeded data and open-weight models; real production complexity, streaming loads, and closed-source models may change magnitudes. - The benchmark quantifies task success but not end-to-end operational cost savings or long-term maintenance expense — those require deployment-level studies.

If helpful, I can produce a short template for estimating ROI for a telecom operator based on these measured success lifts (inputs: engineering cost to build SKILLs, expected reduction in human labor per task, error-costs avoided, deployment scale).

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study uses a controlled, reproducible benchmark with live mock APIs, seeded realistic data, deterministic rubrics, and multiple open-weight models, which gives credible within-benchmark evidence that structured guidance improves agent performance; however it does not measure real-world economic outcomes, relies on synthetic/mock servers and a limited set of scenarios/models, and therefore has limited external validity. Methods Rigorhigh — The authors build a systematic benchmark (37 scenarios across 8 TM Forum API domains), use live mock API servers with representative seeded data, instrument tool-call verification and database state checks, evaluate multiple models under baseline and with-skill conditions, and report quantitative success rates—providing a rigorous and reproducible evaluation of agent capability within the scoped setting. SampleBenchmark of 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724) run against live mock API servers seeded with production-representative data, evaluated across five open-weight LLM conditions with two agent setups (baseline and with SKILL.md) for a total of 185 scenario-runs. Themeshuman_ai_collab productivity adoption GeneralizabilityEvaluated on synthetic/mock API servers rather than live production systems, Single industry domain (telecommunications) — results may not generalize to other sectors, Only open-weight/public models tested; closed-source / proprietary models not evaluated, 37 scenarios may not cover the full diversity or rare edge cases of telecom operations, Deterministic rubrics and automated checks may not capture human judgment, safety, or operational nuance, Performance in benchmark runs may differ from performance in real-world, human-in-the-loop deployments

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework for telecom operations. Other null_result high existence and definition of the SKILLS benchmark framework
0.18
SKILLS comprises 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Other null_result high coverage: number of scenarios (37) and number of API domains (8) included
n=37
0.18
Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. Other null_result high evaluation environment fidelity and evaluation criteria (content checks, tool-call verification, DB state assertions)
0.18
We evaluated open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Other null_result high experimental condition (baseline vs with-skill)
0.18
The SKILL.md used in the with-skill condition encodes workflow logic, API patterns, and business rules as portable domain guidance for agents. Other null_result high presence and content type of injected domain guidance (workflow logic, API patterns, business rules)
0.18
Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. Other positive high skill lift measured as change in task success rate (percentage point improvement) across models
n=185
0.18
MiniMax M2.5 achieved 81.1% success rate with-skill, an increase of +13.5 percentage points over baseline. Other positive high task success rate (percentage) and absolute percent-point lift
n=185
13.5 percentage points
0.18
Nemotron 120B achieved 78.4% success rate with-skill, an increase of +18.9 percentage points over baseline. Other positive high task success rate (percentage) and absolute percent-point lift
n=185
18.9 percentage points
0.18
GLM-5 Turbo achieved 78.4% success rate with-skill, an increase of +5.4 percentage points over baseline. Other positive high task success rate (percentage) and absolute percent-point lift
Success rate = 78.4%; +5.4 percentage points vs baseline
0.18
Seed 2.0 Lite achieved 75.7% success rate with-skill, an increase of +18.9 percentage points over baseline. Other positive high task success rate (percentage) and absolute percent-point lift
Success rate = 75.7%; +18.9 percentage points vs baseline
0.18

Notes