Cleaner code does not make autonomous coding agents more likely to pass tests, but it makes them notably more efficient: repositories with fewer static-analysis violations reduce an agent's token use by about 7–8% and slash file revisitations by roughly one-third, implying cleaner code lowers compute and navigational overhead even when correctness is unchanged.
As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.
Summary
Main Finding
Cleaning a codebase does not meaningfully change a coding agent’s success rate on tasks, but it materially reduces the agent’s operational footprint. In experiments with Claude Sonnet 4.6 across 33 agentic tasks and six minimal-pair repositories, cleaner code reduced token-equivalent metrics by ~7–8% and file revisitations by ~34%, while pass rate was effectively unchanged (−0.9 percentage points aggregated).
Key Points
- Research question: Do structural/stylistic properties of code (“cleanliness”) affect agent correctness (RQ1) or operational footprint (RQ2), and does topology (single dense region vs. multi-module) matter (RQ3)?
- Cleanliness proxy: SonarQube static-analysis rule violations and cognitive-complexity density.
- Minimal-pair design: For each app, two behaviorally equivalent repositories were produced that differ mainly in cleanliness:
- Slopify: degrades a clean repo by introducing SonarQube-flagged issues while preserving tests.
- Vibeclean: cleans an unclean repo by fixing SonarQube-flagged issues while preserving tests.
- Task design: 33 tasks across 6 pairs (13 cognitive-hotspot tasks, 14 multi-module tasks, 6 calibration tasks) that:
- Route agents through regions with the largest cleanliness differences,
- Describe required changes in externally observable terms (no internal names),
- Validate via hidden tests at the application’s public surface.
- Experimental protocol:
- Agent: Claude Code (Claude Sonnet 4.6), default toolset, blind to repo side.
- Trials: 10 runs per side per task → 660 trials; outlier filter removed ~9.7% of trials.
- Metrics recorded: pass rate; input & output tokens; reasoning characters; conversation turns and pre-edit exploration measures; files read; file revisits; lines edited.
- Aggregate quantitative results (micro-averaged across tasks):
- Pass rate: −0.9 percentage points (no practical change).
- Input tokens: −7.1%.
- Output tokens: −8.5%.
- Reasoning characters: −11.1%.
- Conversation turns: −7.0%.
- File revisitations: −33.8%.
- Lines edited: −3.2% (small change).
- Heterogeneity: Per-repo effects vary substantially (e.g., input token reductions ranged from modest to very large on some repos), and some repos showed small positive or negative pass-rate changes per repo, but no consistent correctness advantage.
Data & Methods
- Construction:
- Six minimal pairs (3 primarily Java, 3 primarily Python), three public and three private (to reduce memorization risk).
- Clean/messy sides verified to be behaviorally equivalent via tests.
- Pipelines:
- Slopify (3 phases: build, explore, transform) introduces plausible organic mess (inlining helpers, duplication, dead code) but rejects edits that break tests.
- Vibeclean (2 phases: build, clean) mechanically resolves analyzer issues and may extract helpers, remove dead code, etc.
- Task generation:
- Multi-agent assisted pipeline to map differences, propose task outlines, curate and convert to task + hidden tests + reference implementation; tasks that failed sanity checks were rewritten/dropped.
- Evaluation:
- Containerized sandboxes with bounded resources and identical environments except the source tree.
- Metrics collected per-trial and micro-averaged across tasks. Outlier trials (>50% off per-task median) were removed before averaging.
- Limitations called out by authors:
- Single-model focus (Claude Sonnet 4.6); other models may react differently.
- SonarQube is a proxy, not a full formalization of “cleanliness.”
- Slopify and Vibeclean produce asymmetrical edits; the two directions are not exact inverses.
- Task set and repo choices limit generalizability; private repos protect against memorization but reduce reproducibility.
Implications for AI Economics
- Direct operational cost impact:
- Token usage is a first-order driver of LLM-run costs. A consistent ~7–8% reduction in token consumption per run translates nearly linearly into cost savings per agent invocation (since input tokens dominated total token footprints in prior work).
- Example framing: if input-dominated runs are common, a 7% token reduction reduces per-run token bills by roughly 7% (before accounting for provider pricing granularity, caching, or billing policy).
- Latency and tooling costs:
- Fewer file revisitations (~34% reduction) implies fewer tool calls, less I/O and less back-and-forth, which can reduce wall-clock time and orchestration overhead. That influences engineering throughput and infrastructure costs (fewer API/tool calls, less compute time).
- Scale multiplies small per-run savings:
- For organizations running many agent tasks (CI, automated refactoring, large-scale code generation), modest per-run savings compound to meaningful monthly/annual OPEX reductions.
- Investment decision tradeoffs:
- Correctness unaffected: cleaning does not increase the probability an agent completes a task correctly, so cleanliness investments are not justified by higher agent correctness alone.
- Efficiency-driven ROI: organizations should evaluate cleanliness investments primarily through an efficiency lens (token/latency/operational costs) and possibly developer productivity (less agent churn, clearer diffs).
- Prioritize cleaning hotspots: since tasks routed through messy hotspots incur the observed overheads, focus cleanup efforts on regions frequently visited by agents (e.g., shared services, orchestrators, core libraries).
- Tooling and governance:
- Code-quality gates, linters, and automated cleaning (Vibeclean-style tools) can be an economic lever alongside model selection, prompting, and caching strategies.
- Incorporate codebase cleanliness as a variable in cost-forecasting models for agent-driven dev workflows and MLOps budgets.
- Research & policy implications:
- Firms and benchmarkers should include codebase properties (cleanliness) when estimating operating costs for agentic workflows.
- Further studies should quantify monetary savings under realistic pricing, evaluate other models and agent toolchains, and compare cleanup costs vs. expected savings to inform optimal investment levels.
- Short, actionable takeaways:
- If you run many automated agent tasks, cleaning the codebases agents touch is likely to lower your LLM token bill and reduce time per task—even if it won’t make agents more correct.
- Begin by identifying agent-visited hotspots and adding gating/linting or automated fixes for those modules to maximize cost-effectiveness.
If you’d like, I can: - Convert the token reductions into illustrative dollar estimates under a set of provider pricing assumptions, or - Produce a short checklist for engineering teams to prioritize cleanup activities likely to reduce agent costs.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. Other | positive | high | evaluation protocol (minimal-pair control of repository cleanliness) |
0.8
|
| The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. Other | positive | high | directional construction of repository pairs (degrade or clean) |
0.8
|
| We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Other | positive | high | number of tasks and pairs used in evaluation |
n=33
six pairs
0.8
|
| Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. Developer Productivity | null_result | high | pass rate (task success on hidden tests) |
n=660
0.48
|
| Agents working on cleaner code use 7 to 8% fewer tokens. Organizational Efficiency | positive | high | token usage (number of tokens consumed by agent pipelines) |
n=660
7 to 8% fewer tokens
0.48
|
| Agents working on cleaner code reduce file revisitations by 34%. Organizational Efficiency | positive | high | file revisitations (number of times agents revisit files) |
n=660
34% reduction
0.48
|
| Traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Developer Productivity | positive | high | relevance of maintainability principles to agent computational cost and navigation |
0.48
|
| Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours. Organizational Efficiency | positive | high | factors materially affecting agent behaviour (operational footprint/navigation) |
0.48
|