Structured domain guidance (SKILL.md) reliably boosts LLM agents' telecom operations competence, raising success rates across models by up to ~19 percentage points; MiniMax M2.5 leads at 81.1% success with the skill augmentation.
As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).
Summary
Main Finding
Structured, portable domain guidance (the SKILL.md artifact) materially improves the ability of general-purpose LLM agents to execute telecom operations workflows against real API interfaces. Across 185 scenario-runs over 37 operational scenarios, every evaluated open-weight model achieved higher end-to-end task success when augmented with SKILLS compared with a generic tool-enabled agent.
Key Points
- Problem: Can LLM agents reliably perform telecom operations workflows against real APIs, or do they need explicit domain guidance?
- Benchmark introduced: SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations).
- Coverage: 37 telecom operations scenarios spanning 8 TM Forum Open API domains — TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724.
- Environment: Live mock API servers seeded with production-representative data, MCP tool interfaces, and deterministic evaluation rubrics.
- Evaluation conditions:
- Baseline: generic agent with tool access (no domain guidance).
- With-skill: same agent augmented with a SKILL.md document encoding workflow logic, API patterns, and business rules.
- Deterministic evaluation combines:
- Response content checks,
- Tool-call verification (correct API calls, params),
- Database state assertions (post-conditions in seeded DB).
- Models & high-level results (185 scenario-runs, 5 open-weight model conditions):
- MiniMax M2.5: 81.1% success with-skill (lift +13.5 percentage points).
- Nemotron 120B: 78.4% with-skill (lift +18.9 pp).
- GLM-5 Turbo: 78.4% with-skill (lift +5.4 pp).
- Seed 2.0 Lite: 75.7% with-skill (lift +18.9 pp).
- (Five model conditions were evaluated in total; the summary reports the leading four.)
- Consistent pattern: SKILL augmentation produced a positive and meaningful performance lift across all tested models.
Data & Methods
- Benchmark design:
- 37 scenarios reflect common service-lifecycle operational tasks across TM Forum APIs (ordering, management, inventory, notifications, etc.).
- Each scenario maps to live mock API endpoints that mimic real-world behavior and error modes.
- The mock servers are seeded with production-representative datasets so agent actions have realistic side effects.
- Tooling:
- Agents had access to MCP-style tool interfaces to make API calls (simulate real operator integration).
- SKILL.md is a portable document encoding: workflow logic, API call patterns, business rules, expected invariants and error-handling guidance.
- Evaluation rubric:
- Deterministic checks -> whether the agent produced required content, invoked correct tool calls with correct parameters, and produced the correct final database state.
- Results are reported as scenario success rates; lifts reported are absolute percentage-point differences between baseline and with-skill conditions.
- Models:
- Five open-weight model conditions (four reported leaders above). All evaluated without proprietary fine-tuning; guidance was injected via SKILL.md at runtime.
- Scale:
- 185 total scenario-runs (multiple runs per scenario × model conditions).
Implications for AI Economics
- Value of domain engineering > model-only improvements:
- Providing structured domain guidance (SKILL.md) yields consistent, non-trivial performance gains across multiple LLMs. This implies a potentially high return on investment (ROI) for telecom operators who invest in codifying workflows, API patterns, and business rules as reusable artifacts rather than relying solely on more capable LLMs or costly fine-tuning.
- Cost trade-offs:
- Building and maintaining SKILL artifacts and integration tests requires engineering effort and subject-matter expertise. The measured lifts (up to ~19 pp) provide a quantitative basis to compare that engineering cost against alternatives (e.g., model licensing/purchase, fine-tuning, human-in-the-loop labor).
- Productivity and labor impacts:
- Improved autonomous execution reduces routine operator workload and error rates, enabling faster task throughput and lower operational costs. However, operators should plan for role shifts (more focus on rule authorship, validation, and oversight).
- Standardization and portability benefits:
- A portable SKILL.md format enables reuse across models and deployments; standardizing domain guidance could lower repeated integration costs and create competitive differentiation around tooling and process IP.
- Deployment risk and compliance:
- Deterministic evaluation (tool-call + DB state checks) shows that domain guidance not only improves success rates but also reduces risky miscalls. This decreases operational risk and regulatory exposure when automating critical telecom workflows.
- Model selection vs. guidance investment:
- Smaller or cheaper open models, when combined with strong structured guidance, can approach the performance of larger models. This changes procurement calculus: operators might prefer investing in domain artifacts over exclusively upgrading to larger model capacity.
- Next steps for economic modeling:
- Incorporate benchmark-derived success lifts into cost-benefit models (cost of building/maintaining SKILL vs. labor savings, error-cost reduction, SLA improvements).
- Evaluate long-run maintenance costs (updates as APIs/business rules change) and measure amortized ROI across deployments.
- Extend analysis to closed-weight LLMs, higher-scale scenario sets, and real production APIs to refine estimates.
Limitations / caveats (economic relevance) - Results are from mock API servers with seeded data and open-weight models; real production complexity, streaming loads, and closed-source models may change magnitudes. - The benchmark quantifies task success but not end-to-end operational cost savings or long-term maintenance expense — those require deployment-level studies.
If helpful, I can produce a short template for estimating ROI for a telecom operator based on these measured success lifts (inputs: engineering cost to build SKILLs, expected reduction in human labor per task, error-costs avoided, deployment scale).
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework for telecom operations. Other | null_result | high | existence and definition of the SKILLS benchmark framework |
0.18
|
| SKILLS comprises 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Other | null_result | high | coverage: number of scenarios (37) and number of API domains (8) included |
n=37
0.18
|
| Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. Other | null_result | high | evaluation environment fidelity and evaluation criteria (content checks, tool-call verification, DB state assertions) |
0.18
|
| We evaluated open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Other | null_result | high | experimental condition (baseline vs with-skill) |
0.18
|
| The SKILL.md used in the with-skill condition encodes workflow logic, API patterns, and business rules as portable domain guidance for agents. Other | null_result | high | presence and content type of injected domain guidance (workflow logic, API patterns, business rules) |
0.18
|
| Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. Other | positive | high | skill lift measured as change in task success rate (percentage point improvement) across models |
n=185
0.18
|
| MiniMax M2.5 achieved 81.1% success rate with-skill, an increase of +13.5 percentage points over baseline. Other | positive | high | task success rate (percentage) and absolute percent-point lift |
n=185
13.5 percentage points
0.18
|
| Nemotron 120B achieved 78.4% success rate with-skill, an increase of +18.9 percentage points over baseline. Other | positive | high | task success rate (percentage) and absolute percent-point lift |
n=185
18.9 percentage points
0.18
|
| GLM-5 Turbo achieved 78.4% success rate with-skill, an increase of +5.4 percentage points over baseline. Other | positive | high | task success rate (percentage) and absolute percent-point lift |
Success rate = 78.4%; +5.4 percentage points vs baseline
0.18
|
| Seed 2.0 Lite achieved 75.7% success rate with-skill, an increase of +18.9 percentage points over baseline. Other | positive | high | task success rate (percentage) and absolute percent-point lift |
Success rate = 75.7%; +18.9 percentage points vs baseline
0.18
|