TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.

Summary

Main Finding

TDAD (Test-Driven Agentic Development) is an open-source tool and benchmark methodology that builds an AST-derived code–test dependency graph and performs weighted impact analysis to surface which tests an AI coding agent should run after making a change. Integrating TDAD as a lightweight “skill” (a grep-able test_map.txt plus a 20-line SKILL.md) dramatically reduces regressions and can increase effective resolution/generation in agent workflows. Key empirical results: GraphRAG+TDD reduced test-level regressions by ~70% (562 → 155 P2P failures; 6.08% → 1.82%) on 100 SWE-bench Verified instances with Qwen3-Coder 30B, while an auto-improvement loop raised resolution from 12% → 60% (10-instance subset) with 0% regression. A surprising counterfactual: TDD prompting without localization increased regressions (9.94%).

Key Points

Problem addressed
- Benchmarks for AI coding agents emphasize resolution (fixing failing tests) but largely ignore regressions (breaking previously-passing tests). Regressions are operationally critical and common.
Core idea
- Build a code–test graph from repository ASTs and use weighted impact analysis to predict which tests are most likely affected by a proposed change; provide agents with a compact, static “test map” and a short skill instruction rather than long procedural prompts.
Graph & scoring
- Graph schema: File, Function, Class, Test nodes with edges CONTAINS, CALLS, IMPORTS, TESTS, INHERITS.
- Impact strategies (parallel): Direct (weight 0.95), Coverage (0.80), Transitive call hops (0.70), Imports (0.50). Scores merged with a confidence term (example formula provided in paper).
- Tests selected in tiers (high/medium/low) up to a max (default 50).
Agent integration
- Two static artifacts: test_map.txt (grep-able source→test mappings) and SKILL.md (20-line instruction: fix bug, grep test_map.txt, run related tests, fix failures). No runtime graph DB or external service required; uses grep and pytest.
Empirical highlights
- Phase 1 (100 instances, Qwen3-Coder 30B):
  - Vanilla: 562 P2P failures (6.08% test-level regression); GraphRAG+TDD: 155 failures (1.82%).
  - TDD-only (procedural prompt without graph) increased P2P failures to 799 (9.94) — the “TDD prompting paradox.”
  - GraphRAG led agents to abstain more (more empty patches) but produced far fewer severe regressions.
- Phase 2 (25 instances, Qwen3.5-35B-A3B + OpenCode agent):
  - Resolution improved 24% → 32%; generation improved 40% → 68%; regression remained 0% on this subset.
- Auto-improvement loop:
  - Autonomous iterative refinement boosted resolution from 12% → 60% on a 10-instance subset with 0% regression.
Tooling & reproducibility
- TDAD is pip-installable (NetworkX in-memory default backend, optional Neo4j), zero external runtime dependencies, code and experiments released MIT.

Data & Methods

Dataset
- SWE-bench Verified: human-validated subset of 500 GitHub issues; experiments used:
  - Phase 1: first 100 instances (diverse Python repos).
  - Phase 2: 25 instances selected for diversity.
Models / Agents
- Phase 1: Qwen3-Coder 30B (4-bit quantized) on consumer hardware, deterministic decoding.
- Phase 2: Qwen3.5-35B-A3B (4-bit, MoE) with OpenCode v1.2.24 agent on Apple Silicon.
- Design choice: smaller/local models to show method works without frontier-scale models and to stress context constraints.
Indexing pipeline
- AST Parser (stdlib ast): extract functions, classes, imports, call targets.
- Graph Builder: create nodes/edges, resolve module-scoped calls and imports.
- Test Linker: three-priority strategies to link tests to code—naming conventions, stem prefix matching, directory proximity; special handling for monolithic test modules.
Impact analysis & selection
- Four strategies (direct, transitive, coverage, imports) produce scores merged with a confidence term; tests ranked and tiered. Profiles (conservative/balanced/aggressive) adjust weights.
Evaluation protocol & metrics
- SWE-bench Docker harness: apply patch to baseline repo, run FAIL_TO_PASS (resolution) then PASS_TO_PASS (regression) tests in isolation.
- Metrics: resolution rate (F2P pass), generation rate (non-empty patch), test-level regression rate (total P2P failures / total P2P tests), instance-level regression rate (fraction generated patches with ≥1 P2P failure). Test-level regression is emphasized as primary.
Notable methodological choices
- Minimal runtime requirements (grep + pytest) to ease agent integration.
- Emphasis on test-level regression rate to capture severity of regressions (e.g., breaking 1 vs. 322 tests).

Implications for AI Economics

Evaluation & incentives
- Benchmarks should make regression rates first-class alongside resolution. Current leaderboard focus on resolution incentivizes aggressive but unsafe patches; incorporating regression penalties aligns agent development with maintainer priorities and reduces downstream costs from CI failures and rejected PRs.
Operational cost reductions
- Running full test suites at CI scale is expensive. TDAD’s targeted test selection can reduce the number of tests agents run while catching most regressions, lowering compute/time costs for validation and enabling cheaper automated patch vetting.
Productization & tooling market
- Lightweight, repo-local skills like TDAD (test_map + SKILL.md) are easily packaged and deployed, creating product opportunities for “safety” or “regression-aware” agent add-ons. Firms or open-source providers could commercialize refined impact analyzers, test-mapping services, or skill marketplaces.
Model selection & value of context
- For smaller or constrained models, providing targeted contextual signals (which tests to verify) is more valuable than long procedural prompts. This affects cost-benefit tradeoffs: investing in repository-aware tooling can be more cost-effective than upgrading to larger (costly) models.
Labor & maintenance economics
- Reducing regressions increases the likelihood that maintainers accept agent-authored patches, improving productivity gains from coding agents and reducing time spent on code review/rollbacks. Conversely, high regression rates increase maintenance overhead and reduce trust.
Incentives for autonomous improvement
- The demonstrated auto-improvement loop raises the prospect of agents that iteratively tune their own safety tooling. That could lower operational costs for evolving agents but raises governance questions (validation, versioning, liability) that affect deployment economics.
Externalities & adoption risk
- If toolkits or agents optimize only for resolution, systemic costs (more CI failures, rework, slower deployment) follow. Introducing regression-aware metrics internalizes some externalities, making economic outcomes (faster merges, fewer rollbacks) more favorable.
Limitations and generalizability (economic caveat)
- Results are from Python repos in SWE-bench Verified and on specific Qwen models/agent frameworks; economic benefits will vary by codebase size, test suite structure, and model capabilities. Nevertheless, the approach is low-cost to adopt (pip-installable, no runtime services) and thus has high potential return-on-investment in many settings.

If helpful, I can produce a concise one-paragraph executive summary suitable for non-technical managers, or a checklist for teams that want to adopt TDAD in CI/agent workflows.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper reports large relative improvements (e.g., 70% reduction in regressions, resolution gains) from controlled experiments on a public benchmark and uses multiple ablations (TDD prompting, GraphRAG, auto-improvement). However, sample sizes are modest (100, 25, and a 10-instance subset), experiments are limited to two local Qwen models and a single benchmark, and the paper does not report extensive statistical testing or external validation on diverse codebases, which weakens causal generality. Methods Rigormedium — The methodology combines AST-based code-test graph construction with weighted impact analysis and evaluates the approach in ablation-style comparisons; code and logs are public. Strengths include an explicit algorithmic workflow and multiple experimental conditions. Limitations include small and uneven sample sizes across conditions, limited model diversity, unclear randomization/holding-out protocol, sparse reporting of statistical confidence intervals or hypothesis tests, and potential tuning on the same benchmark used for evaluation. SampleEvaluations are run on SWE-bench Verified benchmark instances; main results use Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances, with an additional 10-instance subset used for the autonomous auto-improvement loop; experiments use local model deployments and the project's open-source code, data, and logs. Themesproductivity human_ai_collab IdentificationControlled comparative experiments on a held benchmark (SWE-bench Verified) that apply different agent workflows (baseline, TDD prompting, TDAD GraphRAG, and an auto-improvement loop) to the same problem instances and measure outcomes (test-level regressions, resolution rate); causal inference rests on within-instance comparisons under different treatments rather than randomization or external instruments. GeneralizabilityLimited number and variety of LLM models (only two Qwen variants) — results may not hold for other architectures or hosted APIs, Benchmarks may not represent production-scale or large, complex codebases (SWE-bench Verified is a research benchmark), Small sample sizes for some experiments (25 and 10-instance conditions) limit robustness and statistical power, Potential overfitting or calibration to the same benchmark used for method development, Quality and structure of test suites in the benchmark may not mirror real-world tests (affects regression measurement), Unclear how agent integration and workflow overhead would affect developer productivity in real teams

Claims (11)

Claim	Direction	Confidence	Outcome	Details
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Error Rate	mixed	medium	ability to resolve issues (resolution rate) and regression rate (tests broken)	0.29
Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. Output Quality	negative	medium	coverage of regression measurement in existing benchmarks	0.29
TDAD (Test-Driven Agentic Development) combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Developer Productivity	positive	high	identification/surfacing of tests likely impacted by code changes (test prioritization)	0.48
Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances. Other	null_result	high	evaluation sample size / benchmark coverage (number of instances per model)	n=125 0.48
TDAD's GraphRAG workflow reduced test-level regressions by 70% (from 6.08% to 1.82%). Error Rate	positive	medium	test-level regression rate (percentage of tests that regressed)	n=125 reduced 70% (from 6.08% to 1.82%) 0.29
When deployed as an agent skill, GraphRAG improved resolution from 24% to 32%. Developer Productivity	positive	medium	resolution rate (percentage of issues/problems resolved)	n=125 improved from 24% to 32% 0.29
TDD (test-driven development) prompting alone increased regressions to 9.94%. Error Rate	negative	medium	regression rate (percentage of tests that regressed) under TDD prompting	n=125 9.94% regression rate 0.29
Smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). Developer Productivity	positive	medium	relative improvement in regression rate and resolution when providing contextual test information vs procedural TDD prompting for smaller models	n=125 0.29
An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. Developer Productivity	positive	medium	resolution rate (increase from 12% to 60%) and regression rate (reported as 0%) on the 10-instance subset	n=10 resolution from 12% to 60%, 0% regression 0.29
For AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. Developer Productivity	positive	medium	effectiveness in reducing regressions and improving resolution when using contextual information vs procedural instructions	n=125 0.29
All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD. Other	null_result	high	availability of code, data, and logs (public repository)	0.48