A test-driven agent workflow that surfaces which tests to run cuts AI-induced regressions by about 70% and raises bug-resolution rates from 24% to 32% on a standard coding benchmark; merely instructing agents to follow TDD increased regressions, while an autonomous improvement loop achieved 60% resolution with zero regressions on a small subset.
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
Summary
Main Finding
TDAD (Test-Driven Agentic Development) is an open-source tool and benchmark methodology that builds an AST-derived code–test dependency graph and performs weighted impact analysis to surface which tests an AI coding agent should run after making a change. Integrating TDAD as a lightweight “skill” (a grep-able test_map.txt plus a 20-line SKILL.md) dramatically reduces regressions and can increase effective resolution/generation in agent workflows. Key empirical results: GraphRAG+TDD reduced test-level regressions by ~70% (562 → 155 P2P failures; 6.08% → 1.82%) on 100 SWE-bench Verified instances with Qwen3-Coder 30B, while an auto-improvement loop raised resolution from 12% → 60% (10-instance subset) with 0% regression. A surprising counterfactual: TDD prompting without localization increased regressions (9.94%).
Key Points
- Problem addressed
- Benchmarks for AI coding agents emphasize resolution (fixing failing tests) but largely ignore regressions (breaking previously-passing tests). Regressions are operationally critical and common.
- Core idea
- Build a code–test graph from repository ASTs and use weighted impact analysis to predict which tests are most likely affected by a proposed change; provide agents with a compact, static “test map” and a short skill instruction rather than long procedural prompts.
- Graph & scoring
- Graph schema: File, Function, Class, Test nodes with edges CONTAINS, CALLS, IMPORTS, TESTS, INHERITS.
- Impact strategies (parallel): Direct (weight 0.95), Coverage (0.80), Transitive call hops (0.70), Imports (0.50). Scores merged with a confidence term (example formula provided in paper).
- Tests selected in tiers (high/medium/low) up to a max (default 50).
- Agent integration
- Two static artifacts: test_map.txt (grep-able source→test mappings) and SKILL.md (20-line instruction: fix bug, grep test_map.txt, run related tests, fix failures). No runtime graph DB or external service required; uses grep and pytest.
- Empirical highlights
- Phase 1 (100 instances, Qwen3-Coder 30B):
- Vanilla: 562 P2P failures (6.08% test-level regression); GraphRAG+TDD: 155 failures (1.82%).
- TDD-only (procedural prompt without graph) increased P2P failures to 799 (9.94) — the “TDD prompting paradox.”
- GraphRAG led agents to abstain more (more empty patches) but produced far fewer severe regressions.
- Phase 2 (25 instances, Qwen3.5-35B-A3B + OpenCode agent):
- Resolution improved 24% → 32%; generation improved 40% → 68%; regression remained 0% on this subset.
- Auto-improvement loop:
- Autonomous iterative refinement boosted resolution from 12% → 60% on a 10-instance subset with 0% regression.
- Phase 1 (100 instances, Qwen3-Coder 30B):
- Tooling & reproducibility
- TDAD is pip-installable (NetworkX in-memory default backend, optional Neo4j), zero external runtime dependencies, code and experiments released MIT.
Data & Methods
- Dataset
- SWE-bench Verified: human-validated subset of 500 GitHub issues; experiments used:
- Phase 1: first 100 instances (diverse Python repos).
- Phase 2: 25 instances selected for diversity.
- SWE-bench Verified: human-validated subset of 500 GitHub issues; experiments used:
- Models / Agents
- Phase 1: Qwen3-Coder 30B (4-bit quantized) on consumer hardware, deterministic decoding.
- Phase 2: Qwen3.5-35B-A3B (4-bit, MoE) with OpenCode v1.2.24 agent on Apple Silicon.
- Design choice: smaller/local models to show method works without frontier-scale models and to stress context constraints.
- Indexing pipeline
- AST Parser (stdlib ast): extract functions, classes, imports, call targets.
- Graph Builder: create nodes/edges, resolve module-scoped calls and imports.
- Test Linker: three-priority strategies to link tests to code—naming conventions, stem prefix matching, directory proximity; special handling for monolithic test modules.
- Impact analysis & selection
- Four strategies (direct, transitive, coverage, imports) produce scores merged with a confidence term; tests ranked and tiered. Profiles (conservative/balanced/aggressive) adjust weights.
- Evaluation protocol & metrics
- SWE-bench Docker harness: apply patch to baseline repo, run FAIL_TO_PASS (resolution) then PASS_TO_PASS (regression) tests in isolation.
- Metrics: resolution rate (F2P pass), generation rate (non-empty patch), test-level regression rate (total P2P failures / total P2P tests), instance-level regression rate (fraction generated patches with ≥1 P2P failure). Test-level regression is emphasized as primary.
- Notable methodological choices
- Minimal runtime requirements (grep + pytest) to ease agent integration.
- Emphasis on test-level regression rate to capture severity of regressions (e.g., breaking 1 vs. 322 tests).
Implications for AI Economics
- Evaluation & incentives
- Benchmarks should make regression rates first-class alongside resolution. Current leaderboard focus on resolution incentivizes aggressive but unsafe patches; incorporating regression penalties aligns agent development with maintainer priorities and reduces downstream costs from CI failures and rejected PRs.
- Operational cost reductions
- Running full test suites at CI scale is expensive. TDAD’s targeted test selection can reduce the number of tests agents run while catching most regressions, lowering compute/time costs for validation and enabling cheaper automated patch vetting.
- Productization & tooling market
- Lightweight, repo-local skills like TDAD (test_map + SKILL.md) are easily packaged and deployed, creating product opportunities for “safety” or “regression-aware” agent add-ons. Firms or open-source providers could commercialize refined impact analyzers, test-mapping services, or skill marketplaces.
- Model selection & value of context
- For smaller or constrained models, providing targeted contextual signals (which tests to verify) is more valuable than long procedural prompts. This affects cost-benefit tradeoffs: investing in repository-aware tooling can be more cost-effective than upgrading to larger (costly) models.
- Labor & maintenance economics
- Reducing regressions increases the likelihood that maintainers accept agent-authored patches, improving productivity gains from coding agents and reducing time spent on code review/rollbacks. Conversely, high regression rates increase maintenance overhead and reduce trust.
- Incentives for autonomous improvement
- The demonstrated auto-improvement loop raises the prospect of agents that iteratively tune their own safety tooling. That could lower operational costs for evolving agents but raises governance questions (validation, versioning, liability) that affect deployment economics.
- Externalities & adoption risk
- If toolkits or agents optimize only for resolution, systemic costs (more CI failures, rework, slower deployment) follow. Introducing regression-aware metrics internalizes some externalities, making economic outcomes (faster merges, fewer rollbacks) more favorable.
- Limitations and generalizability (economic caveat)
- Results are from Python repos in SWE-bench Verified and on specific Qwen models/agent frameworks; economic benefits will vary by codebase size, test suite structure, and model capabilities. Nevertheless, the approach is low-cost to adopt (pip-installable, no runtime services) and thus has high potential return-on-investment in many settings.
If helpful, I can produce a concise one-paragraph executive summary suitable for non-technical managers, or a checklist for teams that want to adopt TDAD in CI/agent workflows.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Error Rate | mixed | medium | ability to resolve issues (resolution rate) and regression rate (tests broken) |
0.29
|
| Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. Output Quality | negative | medium | coverage of regression measurement in existing benchmarks |
0.29
|
| TDAD (Test-Driven Agentic Development) combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Developer Productivity | positive | high | identification/surfacing of tests likely impacted by code changes (test prioritization) |
0.48
|
| Evaluation was performed on SWE-bench Verified with two local models: Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances. Other | null_result | high | evaluation sample size / benchmark coverage (number of instances per model) |
n=125
0.48
|
| TDAD's GraphRAG workflow reduced test-level regressions by 70% (from 6.08% to 1.82%). Error Rate | positive | medium | test-level regression rate (percentage of tests that regressed) |
n=125
reduced 70% (from 6.08% to 1.82%)
0.29
|
| When deployed as an agent skill, GraphRAG improved resolution from 24% to 32%. Developer Productivity | positive | medium | resolution rate (percentage of issues/problems resolved) |
n=125
improved from 24% to 32%
0.29
|
| TDD (test-driven development) prompting alone increased regressions to 9.94%. Error Rate | negative | medium | regression rate (percentage of tests that regressed) under TDD prompting |
n=125
9.94% regression rate
0.29
|
| Smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). Developer Productivity | positive | medium | relative improvement in regression rate and resolution when providing contextual test information vs procedural TDD prompting for smaller models |
n=125
0.29
|
| An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. Developer Productivity | positive | medium | resolution rate (increase from 12% to 60%) and regression rate (reported as 0%) on the 10-instance subset |
n=10
resolution from 12% to 60%, 0% regression
0.29
|
| For AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. Developer Productivity | positive | medium | effectiveness in reducing regressions and improving resolution when using contextual information vs procedural instructions |
n=125
0.29
|
| All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD. Other | null_result | high | availability of code, data, and logs (public repository) |
0.48
|