Autonomous coding agents speed some development tasks but do not replace engineering discipline; firms need stricter process controls, verification, and human gates. The paper proposes Agentic Agile-V and a SCOPE-V task loop to convert conversational prompts into verifiable engineering artifacts.

Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development

Christopher Koch · May 19, 2026

arxiv review_meta medium evidence 8/10 relevance Source PDF

Agentic coding systems can speed some engineering tasks but do not automatically improve outcomes; effective gains require process controls, explicit constraints, traceability, independent verification, and human approval, embodied in the proposed Agentic Agile-V and SCOPE-V frameworks.

Agentic AI coding systems can inspect repositories, plan implementation steps, edit files, call tools, run tests, and submit pull requests. These capabilities make software and hardware development faster in some settings, but current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes. Controlled studies report productivity gains in some enterprise tasks, slowdowns in mature open-source work, moderate but heterogeneous meta-analytic effects, and persistent failures in repository setup, dependency handling, permission gating, and hardware verification. This paper argues that the central problem is no longer prompt engineering; it is engineering process control. It synthesizes evidence from agentic software engineering, GitHub-scale adoption studies, repository-level agent configuration, productivity trials, issue-resolution benchmarks, and hardware/RTL verification research. It proposes Agentic Agile-V, a process framework that uses Agile-V as the lifecycle backbone and a task-level SCOPE-V loop - Specify, Constrain, Orchestrate, Prove, Evolve, and Verify - to convert conversational intent into structured engineering artifacts and acceptance evidence. The paper contributes: (i) a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; (ii) a conversation-to-contract gate that separates exploratory dialogue from implementation; (iii) risk-adaptive feature, bug-fix, testing, and hardware workflows; and (iv) an evidence-bundle acceptance model for agent-generated artifacts. The paper concludes that agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval.

Summary

Main Finding

Agentic AI (coding agents that read repos, plan, run tools, and propose changes) can accelerate some engineering tasks but does not automatically improve outcomes. The primary bottleneck has shifted from prompts to process control: converting conversational intent into verifiable engineering artifacts and acceptance evidence. The paper proposes Agentic Agile-V (Agile-V lifecycle + SCOPE-V task loop) and an evidence-bundle acceptance model to close this gap.

Key Points

Empirical pattern: adoption is already large-scale (AIDev reports 932,791 agent-authored PRs across 116,211 repos and 72,189 developers), but productivity effects are heterogeneous.
- Google RCT: ~21% reduction in time on a complex enterprise task with AI assistance.
- METR RCT: ~19% increase in completion time for mature open-source tasks.
- 2026 meta-analysis: statistically significant but moderate average productivity effect with high heterogeneity.
Main thesis: conversational discovery (chat) is useful for eliciting intent; structured artifacts, constraints, and evidence are required for safe, auditable implementation and acceptance.
Proposed framework: Agentic Agile-V
- Macro: Agile-V lifecycle (Agile iteration + V-model verification / traceability).
- Micro: SCOPE-V task loop — Specify, Constrain, Orchestrate, Prove, Evolve, Verify.
Concrete contributions:
Taxonomy of minimum input artifacts (intent/scope, acceptance criteria, architecture context, constraints, execution context, evidence requirements, risk class) for software, firmware, and hardware tasks.
Conversation-to-contract gate: convert exploratory chat into a reviewed execution brief before implementation.
Risk-adaptive workflows and acceptance gates (R0 exploratory → R3 high-assurance), with required evidence bundles for higher-risk tasks.
Practical task workflows for features, bug fixing, tests, firmware, embedded, and hardware/RTL work.
Hardware/firmware limitations: benchmarks show weak system-level performance (e.g., RealBench low pass rates; some system Verilog tasks pass@1 = 0%), highlighting need for simulation, formal checks, HIL tests and strong traceability.
Process and tooling recommendations: keep minimal, decision-relevant repository instructions (e.g., concise AGENTS.md), require planning and tests inside agent loops, separate exploratory agents from implementation/verification agents, build tooling for evidence generation and auditability.

Data & Methods

Method: bounded evidence synthesis (qualitative synthesis across heterogeneous sources), not a pooled meta-analysis.
Four evidence streams included:
Agentic software-engineering surveys and tool papers (OpenHands, GitHub agent workflows, etc.).
Empirical studies of developer productivity and agent-authored pull requests (Google RCT, METR RCT, AIDev, Claude Code analyses).
Repository-configuration and execution-environment studies (AGENTS.md and repo config empirical studies, RepoMaster, Git-TaskBench).
Hardware/firmware and verification benchmarks (RealBench, FIXME, EDA/hardware surveys).
Inclusion criteria: source must address agentic code generation, repository-aware execution, productivity, configuration/state gating, hardware generation/verification, traceability, or acceptance evidence.
Representative empirical findings cited in the synthesis:
- AIDev: 932,791 agent-authored PRs (116k repos, 72k developers).
- Google RCT: ~21% faster on an enterprise task.
- METR RCT: ~19% slower on mature open-source tasks.
- OpenHands + Claude 3.7: solved ~48.15% of repository tasks in Git-TaskBench.
- Configuration-file studies: mixed effects—can reduce runtime/token use but can also worsen success if mismatched.
- Hardware benchmarks (RealBench/FIXME): very low pass rates on real-world Verilog/system-level tasks.

Implications for AI Economics

Heterogeneous productivity effects → uncertain macro gains:
- Empirical heterogeneity (enterprise vs. mature open-source) implies that aggregate productivity/AI-driven GDP effects are likely context-dependent and may be smaller than optimistic forecasts that ignore process frictions.
Shifted bottleneck raises new economic margins:
- Value shifts from raw code synthesis to process design, verification capacity, and evidence generation. Firms that invest in these capabilities (test infrastructure, traceability, environment provisioning) capture more of the agentic productivity potential.
Labor demand and skill composition:
- Increased demand for verification, test-engineering, devops, and process engineers who can design acceptance gates, CI/HIL pipelines, and evidence bundles.
- Possible reallocation rather than pure substitution: junior/rote coding tasks may be automated in some contexts, but experienced engineers’ time may be reallocated to higher-complexity verification and architecture.
Transaction and coordination costs:
- Agentic workflows raise coordination costs (review load, verification debt). If verification capacity does not scale with generated output, maintenance and rework can erode gains—reducing net productivity improvement and increasing organizational frictions.
Risk and tail-costs affect adoption economics:
- Hardware/firmware domains have high tail-risk and regulatory exposure; expected benefits require larger investments in formal verification and HIL, increasing upfront costs and lengthening payback periods.
Product and tool-market implications:
- Market opportunity for tools that produce auditable evidence bundles, structured-brief editors, repository environment capture, and risk-classification/permission gating. Pricing can reflect value from reducing verification cost and audit risk.
ROI and adoption strategy:
- Firms should treat agentic tool adoption as a process-change investment (not plug-and-play). High ROI likely where tasks are well-scoped, reproducible, and low to medium risk; lower or negative ROI in mature, tightly-coupled codebases or safety-critical hardware without commensurate verification investment.
Policy and regulation:
- Traceability and audit artifacts become economically important for regulated industries; compliance costs may increase if tools do not natively produce verifiable evidence.

Overall: Agentic AI can shift productivity frontiers, but realizing sustained economic gains requires investments in process control, verification capacity, and tooling that generates traceable evidence. Failure to make those investments risks verification debt, rework, and lower-than-expected returns.

Assessment

Paper Typereview_meta Evidence Strengthmedium — The paper synthesizes results from controlled trials, GitHub-scale observational studies, benchmarks, and meta-analytic summaries that collectively show heterogeneous and task-dependent effects; evidence is neither uniformly strong nor absent, but mixed and context-specific rather than establishing broad causal regularities. Methods Rigormedium — The work appears to be a conceptual synthesis and framework-building exercise that draws on multiple empirical sources; however, the abstract does not report a registered, systematic review or formal meta-analytic methodology, and the contribution is primarily theoretical rather than new causal identification. SampleAggregated evidence from controlled productivity trials in enterprise settings, large-scale GitHub adoption and repository analyses, repository-level agent configuration studies, issue-resolution and benchmark tasks, and hardware/RTL verification research and case studies. Themesproductivity org_design human_ai_collab GeneralizabilityFindings are specific to software, firmware, and hardware engineering and may not transfer to other sectors., Effects vary by task type (exploratory vs mature maintenance), repository maturity, and engineering discipline (software vs hardware), limiting broad extrapolation., Results depend on the specific agentic systems, toolchains, and integrations studied; different models or configurations may behave differently., Organizational factors (process maturity, developer skill, permission models) mediate outcomes and limit generalizability across firms and teams., Hardware verification and dependency/permission issues introduce domain-specific constraints that reduce applicability to purely code-only workflows.

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Agentic AI coding systems can inspect repositories, plan implementation steps, edit files, call tools, run tests, and submit pull requests. Other	positive	high	agent capabilities (repository inspection, planning, editing, tool use, testing, PR submission)	0.24
These capabilities make software and hardware development faster in some settings. Task Completion Time	positive	high	speed of software and hardware development	0.24
Current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes. Developer Productivity	null_result	high	engineering outcomes (overall improvement from autonomous code generation)	0.24
Controlled studies report productivity gains in some enterprise tasks. Developer Productivity	positive	high	productivity on enterprise software tasks	0.24
Controlled studies report slowdowns in mature open-source work when using agentic/code-generation systems. Developer Productivity	negative	high	productivity/performance in mature open-source development	0.24
Meta-analytic evidence shows moderate but heterogeneous effects of agentic/code-generation tools on productivity. Developer Productivity	mixed	high	aggregate effect on productivity across studies	moderate but heterogeneous 0.24
Agentic systems show persistent failures in repository setup, dependency handling, permission gating, and hardware verification. Error Rate	negative	high	failure modes/errors in repository and hardware-related tasks	0.24
The central problem for agentic engineering is no longer prompt engineering; it is engineering process control. Organizational Efficiency	neutral	high	primary bottleneck affecting agentic engineering effectiveness (process control vs prompt engineering)	0.04
Agentic Agile-V and the task-level SCOPE-V loop (Specify, Constrain, Orchestrate, Prove, Evolve, Verify) convert conversational intent into structured engineering artifacts and acceptance evidence. Organizational Efficiency	positive	high	ability to convert conversational intent into structured artifacts and acceptance evidence	0.04
The paper provides a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; a conversation-to-contract gate; risk-adaptive workflows; and an evidence-bundle acceptance model for agent-generated artifacts. Organizational Efficiency	neutral	high	availability of process artifacts and workflow models for agentic engineering	0.04
Agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval. Organizational Efficiency	positive	high	importance/value of engineering practices (requirements, traceability, verification, human approval) in presence of agentic AI	0.24