← Papers

Orchestrated human–AI workflows sharply cut effort and defects: across three real modernization projects, the Chiron platform reduced modeled person-days from 1080 to 232.5 and senior-equivalent effort from 1080 to 139.5 days, halved validation issues and raised first-release coverage from 77% to 90.5%, with largest gains when AI was embedded in the end-to-end delivery flow rather than used as an isolated coding assistant.

Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

Maximiliano Armesto, Christophe Kolb · March 20, 2026

arxiv quasi_experimental medium evidence 8/10 relevance Source PDF

Embedding AI agents inside an orchestrated delivery platform (Chiron) across four workflow stages substantially reduced modeled effort and validation issues while increasing first-release coverage on three real software modernization programs compared with a traditional baseline.

Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. We report a retrospective longitudinal field study of Chiron, an industrial platform that coordinates humans and AI agents across four delivery stages: analysis, planning, implementation, and validation. The study covers three real software modernization programs -- a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC) -- observed across five delivery configurations: a traditional baseline and four successive platform versions (V1--V4). The benchmark separates observed outcomes (stage durations, task volumes, validation-stage issues, first-release coverage) from modeled outcomes (person-days and senior-equivalent effort under explicit staffing scenarios). Under baseline staffing assumptions, portfolio totals move from 36.0 to 9.3 summed project-weeks; modeled raw effort falls from 1080.0 to 232.5 person-days; modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days; validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks; and first-release coverage rises from 77.0% to 90.5%. V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. The evidence supports a central thesis: the largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant.

Summary

Main Finding

In a retrospective longitudinal field study of an industrial delivery platform (Chiron), progressively more orchestrated human–AI workflows (V1 → V4) across three software modernization programs were associated with large portfolio-level gains: elapsed time fell from 36.0 to 9.3 summed project-weeks, modeled raw person-days fell from 1080.0 to 232.5, modeled senior-equivalent effort fell from 1080.0 to 139.5 SEE-days, validation-stage issues fell from 8.03 to 2.09 issues per 100 tasks, and task-weighted first-release coverage rose from 77.0% to 90.5%. The largest gains appeared only after acceptance-criteria validation, repository-native review, and hybrid human–agent execution were added (V3–V4), suggesting that embedding AI in an orchestrated workflow yields bigger team-level delivery improvements than isolated agentic coding assistance alone. The study is descriptive, single-organization, and retrospective — informative but not causal.

Key Points

Scope: three modernization programs (COBOL→Python bank app ~30k LOC; large accounting ACAS ~400k LOC; .NET→.NET8 mortgage app ~30k LOC) observed across five delivery configurations: Traditional baseline, V1–V4 (progressive platform evolution).
Versions:
- V1: tool-centric agent use (analysis, docs, autonomous task execution) — early speed gains but worse downstream quality.
- V2: CLI orchestration — modest improvement.
- V3: web workspace, task-centric orchestration, first-generation acceptance-criteria validation — sharp improvements in implementation & validation.
- V4: repository authentication, branches/PRs, repo-native review, doc ingestion, hybrid human–agent execution — best balance of speed, coverage, and quality.
Quantitative headline changes (portfolio-level): weeks 36.0 → 9.3 (3.87× speedup); raw person-days 1080.0 → 232.5 (−78.5%); SEE-days 1080.0 → 139.5 (−87.1%); issues/100 tasks 8.03 → 2.09 (−74.0%); coverage 77.0% → 90.5% (+13.4 percentage points).
V1→V4 (same nominal agentic staffing) shows a 3.08× speedup, 67.5% reductions in time and modeled effort, 75.8% fall in issue-load, and +37.9 pp coverage — indicating orchestration (not merely agent availability) drives later gains.
Stage-level effects: largest and most consistent reductions in Analysis and Planning; implementation/validation shorter but less uniformly so.
Review containment under V4: roughly 51% of pre-validation issues are caught in review (portfolio-weighted), shifting defect discovery earlier and lowering expensive rework.
Sensitivity: modeled SEE reductions robust across plausible junior-to-senior weighting assumptions.

Data & Methods

Design: retrospective longitudinal field study; observational unit = project × version cell (15 cells total: 3 projects × 5 configs). Descriptive analysis, no hypothesis tests.
Data provenance: assembled from engineering records and practitioner recall. The authors explicitly treat reconstructed values conservatively; not an instrumented, contemporaneous telemetry dataset.
Measured (observed) outcomes:
- Stage durations for Analysis, Planning, Implementation, Validation (τp,v,s) and summed total Tp,v.
- Backlog task counts Np,v.
- Issues reaching downstream validation Ip,v.
- First-release coverage Cp,v (fraction of requirements completed and accepted at initial handoff).
Normalized quality metric: Lp,v = 100 × Ip,v / Np,v (issues per 100 tasks) — interprets downstream escape rate, not intrinsic defect density.
Modeled effort scenarios (used only for staffing-normalized estimates):
- Traditional team: 6 people (1 architect, 2 backend, 1 frontend, 2 QA). Raw person-days: Eraw = 5 · h · Tp,v.
- Agentic team: 5 people (1 senior architect, 2 junior AI operators, 2 junior QA). Senior-equivalent effort ESEE = 5 · σ · Tp,v with baseline junior = 0.5 senior-equivalent.
Robustness checks: project-level trajectories, V1 vs V4 contrast (controls for same nominal staffing), sensitivity to junior-to-senior weight, leave-one-project-out aggregation.
Limitations explicitly noted by authors: single organization, not randomized, versions serially dependent, some values reconstructed from recall, no post-release lifetime cost or reliability measurement, no isolation of specific model/prompt effects.

Implications for AI Economics

System-level complementarity matters: Productivity gains from LLMs in software are not purely additive at the task level; large team-level returns materialize when AI is embedded into orchestrated workflows (decomposition, acceptance-criteria validation, repo-native review). This implies complementarities between AI capabilities and organizational processes.
Reallocation of labor and roles: Modeled senior-equivalent effort fell dramatically under the platform evolution. If generalizable, firms could reduce elapsed time and senior labor hours, shifting demand toward roles that design and operate orchestration layers, validate acceptance criteria, and manage hybrid human–AI processes (i.e., coordination, review, and quality-containment roles).
Cost structure and speed vs. quality trade-offs: Early-stage tool adoption (V1–V2) produced speed gains but worsened downstream quality and coverage, illustrating a potential short-run productivity illusion. Investment in orchestration (V3–V4) appears necessary to convert local gains into durable, lower-cost delivery without higher late-stage defects.
Measurement challenges for AI-driven productivity: Traditional single-metric productivity measures can be misleading. This study highlights the need for multi-dimensional metrics (elapsed time, downstream escapes per unit work, first-release coverage, containment points) and workload-normalized measures when comparing human and hybrid teams.
Market and organizational implications: Faster delivery with improved first-release coverage reduces time-to-market and rework costs, altering pricing, contract structures (e.g., fixed-price vs. time-and-materials), and competitive dynamics in software services and modernization markets. Returns to scale may increase for firms that successfully build orchestration capabilities.
Caution on generalizability and causality: Results are descriptive for one platform and org. Economists and policymakers should avoid extrapolating precise magnitudes without multi-firm, randomized, or longitudinal causal studies that capture post-release reliability, maintenance costs, and lifecycle impacts.
Research priorities suggested:
- Multi-organization and randomized trials to disentangle component effects (model capability vs orchestration vs operator skill).
- Lifecycle studies measuring post-release reliability, maintenance costs, and long-run labor demand.
- Economic modeling of complementarities between AI, coordination capital, and skilled labor to forecast sectoral labor shifts and upskilling needs.

Overall, this study provides descriptive industrial evidence that orchestration and process integration amplify the productive and quality benefits of agentic tools in software delivery — an economically important complement to raw model capability that changes how AI impacts labor, costs, and organizational design in software production.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The study uses real-world projects and multiple measurable outcomes showing large, consistent improvements as the platform evolved, which supports a credible link between platform changes and performance; however, lack of randomization, small sample (three projects), potential temporal confounders (learning, parallel process improvements), and reliance partly on modeled effort limit causal certainty. Methods Rigormedium — Rigorous longitudinal measurement of stage-level outcomes and explicit modeled staffing scenarios increase transparency, and multiple platform versions allow dose–response-like assessment; nonetheless, the study lacks an experimental design or quasi-experimental controls, has limited sample heterogeneity, and depends on assumptions in the effort-modeling that are not shown to be validated externally. SampleThree industrial software modernization programs observed under five delivery configurations (traditional baseline plus platform versions V1–V4): a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC); data include stage durations, task volumes, validation-stage issue counts, and first-release coverage, with additional modeled outcomes of person-days and senior-equivalent effort under standardized staffing scenarios. Themesproductivity human_ai_collab IdentificationRetrospective longitudinal within-project comparison across five delivery configurations (baseline, V1–V4) for three real software modernization programs; observed outcomes (stage durations, task volumes, validation issues, first-release coverage) are measured directly, while effort is estimated via explicit staffing scenarios to normalize across configurations; causal inference relies on before–after changes across successive platform versions and consistency of effects across projects rather than random assignment or instrumental variables. GeneralizabilitySmall sample (three projects) limits statistical generalization to broader populations of software projects, All cases are software modernizations; results may not transfer to greenfield development or non-modernization contexts, Platform-specific features (Chiron) and particular technical stacks (COBOL, .NET/Angular) may limit applicability to other tools or languages, Potential organizational- and team-specific practices (processes, skills, incentives) may be confounded with platform effects, Modeled effort reductions depend on staffing assumptions that may not hold in other organizational settings

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. Research Productivity	null_result	high	distribution of prior evidence (individual task vs team-level delivery) in the literature	0.24
Portfolio totals move from 36.0 to 9.3 summed project-weeks under baseline staffing assumptions (across the three studied programs and five delivery configurations). Task Completion Time	positive	high	summed project-weeks (portfolio time)	n=3 36.0 to 9.3 summed project-weeks 0.48
Modeled raw effort falls from 1080.0 to 232.5 person-days under the platform configurations studied (baseline -> V4 aggregate). Developer Productivity	positive	high	raw effort (person-days)	n=3 1080.0 to 232.5 person-days 0.48
Modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days under the platform configurations studied. Developer Productivity	positive	high	senior-equivalent effort (SEE-days)	n=3 1080.0 to 139.5 SEE-days 0.48
Validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks across the portfolio as platform versions progress. Error Rate	positive	high	validation-stage issues per 100 tasks	n=3 8.03 to 2.09 issues per 100 tasks 0.48
First-release coverage rises from 77.0% to 90.5% across the portfolio as platform versions progress. Output Quality	positive	high	first-release coverage (percent of tasks covered on first release)	n=3 77.0% to 90.5% 0.48
V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. Team Performance	positive	high	stage durations (speed), first-release coverage, validation-stage issue load	n=3 0.48
The largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant. Organizational Efficiency	positive	high	aggregate team/organizational performance (speed, coverage, issue load) when AI is embedded in workflow vs used as isolated assistant	n=3 0.48
The study covers three real software modernization programs: a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC). Other	null_result	high	study programs and codebase sizes (lines of code)	n=3 0.48
The study observes five delivery configurations: a traditional baseline and four successive platform versions (V1–V4). Adoption Rate	null_result	high	delivery configuration variations (baseline, V1–V4)	n=3 0.48