Human-crafted 'skills' let AI agents reliably program real IoT devices, whereas out-of-the-box LLM skills often fail due to timing and hardware-specific constraints; hardware-in-the-loop validation across platforms shows human expertise remains crucial for embedded software success.

Skilled AI Agents for Embedded and IoT Systems Development

Yiming Li, Yuhan Cheng, Mingchen Ma, Yihang Zou, Ningyuan Yang, Wei Cheng, Hai "Helen" Li, Yiran Chen, Tingjun Chen · March 20, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

In hardware-in-the-loop embedded IoT development, concise human-expert 'skills' integrated into agentic systems produce near-perfect task success rates, while LLM-generated skills and no-skills configurations perform substantially worse when validated on real devices.

Large language models (LLMs) and agentic systems have shown promise for automated software development, but applying them to hardware-in-the-loop (HIL) embedded and Internet-of-Things (IoT) systems remains challenging due to the tight coupling between software logic and physical hardware behavior. Code that compiles successfully may still fail when deployed on real devices because of timing constraints, peripheral initialization requirements, or hardware-specific behaviors. To address this challenge, we introduce a skills-based agentic framework for HIL embedded development together with IoT-SkillsBench, a benchmark designed to systematically evaluate AI agents in real embedded programming environments. IoT-SkillsBench spans three representative embedded platforms, 23 peripherals, and 42 tasks across three difficulty levels, where each task is evaluated under three agent configurations (no-skills, LLM-generated skills, and human-expert skills) and validated through real hardware execution. Across 378 hardware validated experiments, we show that concise human-expert skills with structured expert knowledge enable near-perfect success rates across platforms.

Summary

Main Finding

Concise, human-curated "skills" — small, structured documents encoding peripheral initialization patterns, timing constraints, and known failure modes — enable AI agents to generate reliable firmware for real embedded/IoT hardware. In a 378 hardware-in-the-loop (HIL) evaluation (3 platforms × 42 tasks × 3 skill conditions), human-expert skills produced near-perfect success (Arduino 42/42, ESP-IDF 41/42, Zephyr 41/42), while raw LLM knowledge or LLM-synthesized skills were inconsistent and often degraded performance. Skill quality and grounding in real hardware behavior matter more than merely supplying more documentation or larger models.

Key Points

Benchmark and scale
- IoT-SkillsBench: 3 platform-framework pairs (ATmega2560 + Arduino, ESP32-S3 + ESP-IDF, nRF52840 + Zephyr), 23 peripherals, 42 tasks across 3 difficulty levels (basic GPIO/ADC/UART; protocol-level I2C/SPI/1-Wire; system-level interrupts and multi-device integration).
- Total experiments: 378 HIL-validated instances (each task × platform × skill condition × 5 runs; five attempts per instance).
Agent & skill conditions
- Agent: minimal 3-node pipeline (manager, coder, assembler) using Claude Sonnet (4.5) as backbone.
- Skill configurations: no-skills (baseline), LLM-generated skills (Claude Sonnet 4.6 synthesized), and human-expert skills (concise, focused, grounded in observed failure modes).
Outcome taxonomy and metrics
- Outcomes per attempt: Compile Failure (CF), Behavior Failure (BF), Behavior Correct (BC).
- Aggregates: pass@1 (BC on first attempt) and pass@5 (BC in up to five attempts).
Empirical results
- No-skills: good on simple tasks for well-documented platforms (Arduino & ESP-IDF Level-1), but performance degrades on protocol/system-level tasks and on less-represented frameworks (Zephyr).
- LLM-generated skills: inconsistent — sometimes neutral, sometimes harmful (reinforcing incorrect platform assumptions); highest token consumption.
- Human-expert skills: consistently best performance and moderate token overhead; failures that remain are due to irreducible hardware ambiguities (e.g., voltage incompatibility, nonstandard encoder behavior).
Token usage (per-task averages reported)
- No-skills: ~300 input tokens, ~1,200 output tokens.
- LLM-generated skills: ~8,500–9,500 input tokens, ~1,500–2,000 output tokens.
- Human-expert skills: ~650–2,900 input tokens, ~1,700–4,600 output tokens.
- LLM-generated skills increased token cost substantially and did not reliably improve correctness.

Data & Methods

Platforms & toolchains
- Arduino Mega 2560 Rev3 (ATmega2560) with Arduino CLI v1.4.1; ESP32-S3-BOX-3 with ESP-IDF v5.1.2; Arduino Nano 33 BLE Rev2 (nRF52840) with Zephyr via nRF Connect SDK v2.7.0.
Tasks and peripherals
- 42 tasks hand-designed and hardware-validated; tasks include exact pin mappings to avoid extra mapping ambiguity.
- Peripherals cover simple actuators/sensors and complex buses (I2C, SPI, UART, ADC, GPIO, encoders, RTCs, etc.).
Skills generation and format
- LLM-produced skills: generated by prompting a modern LLM (Claude Sonnet 4.6) to summarize embedded knowledge from its latent parametric memory.
- Human-expert skills: authored by embedded engineers who were given LLM outputs, error logs, and runtime observations to ground skill content; each skill kept concise, focused, and in a YAML+Markdown format for efficient retrieval.
Agent pipeline and evaluation
- LangGraph-based 3-node agent: manager (select skills), coder (generate firmware), assembler (project scaffolding and build files).
- Five independent generation attempts per task; identical compilation commands and toolchains used; flashed to real hardware; human-in-the-loop validation of hardware behavior.
- Human validation consumed ~100 hours.
Measurements recorded
- CF/BF/BC per attempt, pass@1, pass@5; input/output token counts per model call; qualitative notes on failure causes.

Implications for AI Economics

Cost vs. quality trade-off: tokens, expertise, and deployment risk
- Token/compute costs are nontrivial and can rise sharply when the agent must process large, noisy documents or LLM-synthesized skill sets (LLM-generated skills used an order of magnitude more input tokens than no-skills). Higher token use increases per-task monetary compute cost and latency.
- Human-expert skill creation is an upfront labor cost but yields much higher reliability and lower overall wasted cycles (fewer failed flashes, less debugging, fewer hardware cycles). Economically, investing in curated domain knowledge can reduce downstream operational costs and failure risks.
Scalability and marginal cost of adding domains
- Skills-based architecture scales: adding a new platform or peripheral requires authoring a focused skill rather than retraining models or expanding prompts with long documents. The marginal cost is the expert time to author and validate a small skill file; marginal benefits are typically high if the skill is grounded.
- This suggests a favorable ROI for centralized or marketplace-style repositories of curated skills (one-time curation cost, many reuses across projects).
Labor dynamics and specialization
- The value-added work shifts: routine firmware coding for well-documented tasks may be automated, but expert engineers remain crucial for creating, validating, and maintaining skills, and for resolving hardware ambiguities that cannot be fixed in software alone.
- Firms may observe a reallocation of engineer time from routine implementation to knowledge engineering, validation, integration, and resolving hardware-edge cases.
Product & service opportunities
- Market for curated skills: companies can monetize high-quality skill libraries (peripheral/platform bundles) or offer subscription services that supply validated skills and HIL testing integration.
- Tooling and auditing services: guaranteed-HIL-validated agent outputs and assurance services (e.g., reduction of BF/CF rates) can command premium pricing in safety-critical or production deployments.
Design implications for AI tooling and pricing
- Pricing models for AI-assisted firmware development should account for:
  - Token costs (higher when skills are large/noisy).
  - Expert curation costs (upfront).
  - HIL testing costs (hardware usage, human validation).
  - Liability/risk premiums for deployments where hardware failures are costly.
- Token-frugal architectures (skill retrieval by header, concise prose, modular skills) provide good economics: lower compute spend while preserving correctness.
Policy and investment considerations
- Standards and open ecosystems: public, vetted skill repositories (open-source) could reduce overall industry verification costs and speed adoption — but require incentives for experts to contribute.
- R&D priorities: investing in methods to compress/verify procedural knowledge, human-in-the-loop HIL feedback loops, and standardized skill formats can amplify economic value by reducing per-deployment verification time.
Limits that affect economic value
- Hardware-intrinsic failures (voltage incompatibilities, non-standard sensor behavior) are not solvable by firmware alone; automated agents cannot fully substitute for physical testing and domain expertise in some cases.
- Over-reliance on LLM-generated procedural knowledge without grounding risks wasted compute and increased debugging costs.

Practical takeaways for decision-makers - If you plan to deploy AI agents for embedded development, budget for expert skill curation and HIL validation — these are cost-effective investments that substantially reduce downstream failures. - Favor modular, concise skill representations and retrieval-first architectures to control token costs. - Consider business models offering curated skill libraries plus HIL validation as a premium offering to industrial IoT customers where failures are costly.

If you want, I can (a) estimate rough monetary costs under current cloud LLM token prices for the three skill conditions, (b) draft a simple ROI model comparing upfront expert-skill authoring vs ongoing manual debugging hours saved, or (c) outline what a commercial “skill marketplace” product and pricing tiers might look like. Which would be most useful?

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The study reports a large number of hardware-validated trials (378) across multiple platforms and peripherals, providing strong internal evidence that human-expert skills materially improve success rates; however, external validity is limited by the selected platforms, tasks, and possibly bespoke human skill engineering, and the setup does not establish broader economic impacts or randomized causal identification beyond controlled comparisons. Methods Rigormedium — Methodological strengths include systematic task design across difficulty levels, hardware-in-the-loop validation, and multiple agent configurations; weaknesses include potential selection bias in tasks and platforms, unclear details about LLM models and tuning, reliance on human-engineered skills (which may be idiosyncratic), and limited information on replication protocols and statistical testing. SampleBenchmark consists of 3 representative embedded platforms, 23 peripherals, and 42 tasks spanning three difficulty levels; each task is evaluated under three agent configurations (no-skills, LLM-generated skills, human-expert skills), yielding 378 hardware-validated experiments; paper does not report large-scale field deployment or workforce-level data. Themeshuman_ai_collab productivity IdentificationControlled benchmark experiments that compare three agent configurations (no-skills, LLM-generated skills, human-expert skills) on the same set of tasks and hardware; causal inference comes from within-task, cross-configuration comparisons with real hardware validation (no randomization or external instrumental variation reported). GeneralizabilityLimited to the three specific embedded platforms and 23 peripherals tested — results may not hold for other hardware architectures or industrial controllers, 42 tasks may not cover the full range of real-world embedded programming problems (complex OEM stacks, safety-critical systems, legacy hardware), Human-expert skills were hand-crafted and may not scale or be reproducible across teams or organizations, Performance depends on the specific LLM(s) and agent implementation used; results may change with different models or toolchains, Benchmarks focus on correctness/success in task execution, not downstream economic outcomes like productivity at scale or time-to-market

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Large language models (LLMs) and agentic systems have shown promise for automated software development. Developer Productivity	positive	high	automation-assisted software development capability	0.08
Applying them to hardware-in-the-loop (HIL) embedded and Internet-of-Things (IoT) systems remains challenging due to the tight coupling between software logic and physical hardware behavior; code that compiles successfully may still fail when deployed on real devices because of timing constraints, peripheral initialization requirements, or hardware-specific behaviors. Error Rate	negative	high	code failure / runtime correctness when deployed to hardware	0.24
We introduce a skills-based agentic framework for HIL embedded development together with IoT-SkillsBench, a benchmark designed to systematically evaluate AI agents in real embedded programming environments. Research Productivity	null_result	high	availability of a skills-based agentic framework and benchmark	0.48
IoT-SkillsBench spans three representative embedded platforms, 23 peripherals, and 42 tasks across three difficulty levels. Research Productivity	null_result	high	benchmark scope (platforms, peripherals, tasks, difficulty levels)	n=42 3 platforms; 23 peripherals; 42 tasks; three difficulty levels 0.8
Each task is evaluated under three agent configurations (no-skills, LLM-generated skills, and human-expert skills) and validated through real hardware execution. Research Productivity	null_result	high	evaluation configuration and validation modality	n=3 three agent configurations (no-skills, LLM-generated skills, human-expert skills); hardware validation 0.8
Across 378 hardware validated experiments, concise human-expert skills with structured expert knowledge enable near-perfect success rates across platforms. Output Quality	positive	high	task success rate (hardware-validated)	n=378 near-perfect success rates across platforms 0.48