AI coding models generated thousands of reliable unit tests in hours and substantially reduced regression risk during major refactors; supervised, iterative test generation constrained model-driven code changes and cut manual labor for a prototype codebase.

AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

Ema Smolic, Mario Brcic, Luka Hobor, Mihael Kovac · April 03, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

In a single industrial case study, coding models were used to generate ~16,000 reliable unit-test lines in hours and to support supervised refactoring that reached up to 78% branch coverage in critical modules, materially reducing regression risk during large-scale changes.

Many software systems originate as prototypes or minimum viable products (MVPs), developed with an emphasis on delivery speed and responsiveness to changing requirements rather than long-term code maintainability. While effective for rapid delivery, this approach can result in codebases that are difficult to modify, presenting a significant opportunity cost in the era of AI-assisted or even AI-led programming. In this paper, we present a case study of using coding models for automated unit test generation and subsequent safe refactoring, with proposed code changes validated by passing tests. The study examines best practices for iteratively generating tests to capture existing system behavior, followed by model-assisted refactoring under developer supervision. We describe how this workflow constrained refactoring changes, the errors and limitations observed in both phases, the efficiency gains achieved, when manual intervention was necessary, and how we addressed the weak value misalignment we observed in models. Using this approach, we generated nearly 16,000 lines of reliable unit tests in hours rather than weeks, achieved up to 78\% branch coverage in critical modules, and significantly reduced regression risk during large-scale refactoring. These results illustrate software engineering's shift toward an empirical science, emphasizing data collection and constraining mechanisms that support fast, safe iteration.

Summary

Main Finding

AI coding models can rapidly build large, reliable unit-test suites and then use those tests to enable safe, large-scale, model-assisted refactoring in a production frontend codebase. In this single-case study the authors generated ~16k lines of tests in hours (instead of weeks), achieved up to 78% branch coverage in targeted modules, and performed an invasive refactor that reduced coupling and average cyclomatic complexity while preserving behavior via test-guarding. Key success factors were a planner/executor multi-agent setup, persistent rule files, mutation testing to remove ineffective tests, and a human reviewer at the end of iterative loops.

Key Points

Problem: Many commercial systems start as MVPs with little testing; this limits the effectiveness of AI-assisted development and makes refactors risky.
Two-stage workflow:
AI-assisted test-suite construction (Plan → Act → Verify loop with a strong planner and cheaper executors; persistent rule files; mutation testing to prune ineffective tests).
AI-assisted, test-guarded refactor (models refactor code but must not alter tests; human-in-loop for approval/fix after iterations).
Models & tooling:
- Planner: Gemini 2.5 Pro (large context window) via CLI.
- Executors: Cursor integrated models (Auto mode).
- Persistent project rule files (e.g., GEMINI.md, .cursorrules) and iterative Markdown plans.
Test-generation outcomes:
- 87 test files, 382 individual test cases.
- ~11,000 LOC of test specifications; >16,000 LOC including mocks/fixtures.
- 78.12% branch coverage and 67.85% line coverage in targeted, logic-heavy modules.
- Test infrastructure: centralized mocks, API-mocking layer, global harness.
Refactor outcomes:
- Frontend/src: 18,619 → 21,624 LOC (+16.1%); file count 237 → 263 (+26).
- Many deletions/additions indicating structural redistribution (120 deletions, 146 additions).
- Routing layer (src/app) shrank 17,872 → 6,201 LOC (−65.3%).
- Internal imports in routing: 893 → 379 (−57.5% dependency density).
- Functions/components increased 806 → 1,022; average cyclomatic complexity decreased 2.24 → 2.13 (routing layer avg → 1.97).
- Result: clearer layered architecture (features, shared, domains), lower coupling, modest LOC growth driven by modularization.
Failure modes & mitigation:
- Value misalignment: models often produced many low-quality or ineffective tests; addressed via explicit rules, mutation testing, and iterative pruning.
- Hallucinations and error introduction risk; mitigated by test-guarding and human review.
Validity limits: single React/Next.js frontend case; outcomes may not generalize to other stacks or legacy backends.

Data & Methods

System under study: commercial React/Next.js frontend (~19k LOC, mostly TypeScript/TSX).
Architecture of experiment:
- Hierarchical multi-agent pipeline: planner (large-context model) prepares iteration plans; executors implement file-level changes subject to persistent rules.
- Plan-Act-Verify loop with automated test runs as primary verifier during generation; human inspected iterations after automated passes.
- Mutation testing used to detect and remove ineffective tests.
- Rule/config files enforced naming, mocking, and no-modification-of-source-code constraints (except rare, approved renames).
Metrics used:
- Test metrics: number of test files/cases, test LOC, branch & line coverage.
- Code metrics: file counts, LOC, AST-derived measures (import counts; cyclomatic complexity), number of functions/components.
- Structural analysis via AST parsing to measure coupling and modularity changes.
Quantitative outcomes (selected):
- Tests: 87 spec files, 382 tests, ~11k spec LOC, >16k total test-related LOC.
- Coverage: up to 78.12% branch coverage in targeted modules.
- Refactor: routing layer LOC −65.3%; overall frontend/src LOC +3,005 (16.1%); internal imports in routing −57.5%; avg cyclomatic complexity 2.24 → 2.13.

Implications for AI Economics

Productivity gains and opportunity cost reduction
- Large time savings: producing test suites measured in hours vs. weeks reduces calendar time and developer hours required for safety infrastructure, enabling faster feature cycles and cheaper refactors.
- Firms that adopt similar pipelines may obtain first-mover advantages in maintaining and scaling previously brittle MVPs with lower overhead.
Reallocation of developer labor and task composition
- Routine, repetitive work (writing scaffolding tests, some refactor edits) can be automated; human roles shift toward rule/spec creation, supervision, validation, and higher-level design.
- Demand likely rises for engineers who can craft evaluation harnesses, mutation testing, domain-specific rule files, and oversee LLM outputs—skills that are complementary to models.
Quality, risk, and robustness economics
- Test-guarded model changes reduce regression risk, making otherwise costly refactors economically viable.
- However, superficial automation risks producing test suites that codify existing bugs or capture poor behaviors; deterministic evaluation frameworks (mutation testing, explicit metrics) are economically valuable public goods that increase trust.
Market and product implications
- Commercial opportunities for toolchains: multi-agent orchestration, model-harness rule repositories, automated mutation testing, and verification-as-a-service.
- Startups/firms with early adoption can reduce maintenance costs and accelerate product-market fit for legacy MVPs; incumbents may invest to retain advantages.
Labor market and distributional effects
- Potential downward pressure on demand for junior developers whose early work is heavily testing/refactoring; counterbalanced by new demand for higher-skilled roles (LLM engineers, verification engineers).
- Geographic and organizational distribution: firms that can invest in model/tooling capture productivity gains; smaller shops without such capabilities may lag or outsource.
Policy and governance considerations
- Need standards and reproducible evaluation metrics (coverage, mutation-resilience thresholds) to avoid “audit-washing” where tests exist but are ineffective.
- Liability and accountability: if AI-assisted refactors introduce defects that pass an AI-generated test suite, legal and contractual frameworks must address where responsibility lies.
Limits and caution
- Single-case evidence: generalization requires replication across stacks (backends, compiled languages, legacy code).
- Hidden costs: initial investment in rule files, planner/executor setup, and human supervisory effort is nontrivial; small projects may not realize net gains.
- Value misalignment and model behavior: without strong, codified objectives and deterministic checks, the economic value of automation can be eroded by low-quality outputs that require manual remediation.
Research and market gaps worth investing in
- Standardized benchmarks for AI-generated test quality (beyond coverage) and economically meaningful thresholds.
- Tooling for easy creation/versioning of persistent rule files and for mutation-test orchestration.
- Empirical studies across technology stacks to measure general equilibrium effects on developer productivity and labor demand.

Summary take: this case study illustrates that well-designed AI-assisted pipelines can materially lower costs and risks of refactoring by producing fast, reliable tests and performing refactors under test-guarding—creating measurable productivity gains and new economic opportunities, but also raising questions about labor reallocation, governance, and the need for robust evaluation standards.

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper is a single-case, observational study without a control group or randomized assignment; while it reports concrete metrics (lines of tests, coverage, time savings), there is no counterfactual or systematic comparison to alternative workflows, so causal claims about productivity or risk reduction are suggestive but not strongly supported. Methods Rigormedium — The study presents a clear, repeatable workflow, quantitative metrics (test lines, branch coverage, time), and documented error modes and human interventions, which shows reasonable methodological care for a case study; however it lacks experimental controls, detailed replication materials (e.g., exact prompts, model versions, dataset snapshots), and formal statistical analysis, limiting rigor. SampleA single industrial case study on a prototype/MVP software system (several 'critical modules') undergoing large-scale refactoring; authors applied coding models to iteratively generate unit tests and assist refactoring under developer supervision, producing ~16,000 lines of tests in hours and achieving up to 78% branch coverage in targeted modules. Themesproductivity human_ai_collab org_design GeneralizabilitySingle proprietary codebase — results may not hold across different projects or domains, MVP/prototype code characteristics (e.g., architecture, testability) may differ from mature systems, Outcomes depend on the specific coding models, prompts, and toolchain used (not universally reproducible), Developer expertise and supervision level influence success — may not generalize to less experienced teams, May not apply to safety-critical, highly stateful, or non-deterministic systems where test oracles are hard to define

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Using coding models, we generated nearly 16,000 lines of reliable unit tests in hours rather than weeks. Developer Productivity	positive	high	unit test lines generated	nearly 16,000 lines 0.18
The generated tests achieved up to 78% branch coverage in critical modules. Output Quality	positive	high	branch coverage	78% branch coverage 0.18
This approach significantly reduced regression risk during large-scale refactoring. Error Rate	positive	medium	regression risk	0.05
The described workflow constrained refactoring changes and enabled model-assisted refactoring under developer supervision, with proposed code changes validated by passing tests. Organizational Efficiency	positive	high	constrained refactoring changes (safety of refactoring)	0.18
The study observed errors and limitations in both phases (test generation and refactoring), and manual intervention was necessary at times. Error Rate	negative	high	occurrence of errors and need for manual intervention	0.09
The paper documents best practices for iteratively generating tests to capture existing system behavior before model-assisted refactoring. Training Effectiveness	positive	high	effectiveness of iterative test-generation workflow	0.18
The authors observed weak value misalignment in the coding models and describe how they addressed it. Ai Safety And Ethics	mixed	medium	model value alignment / alignment mitigation	0.02