The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Cleaner code does not make autonomous coding agents more likely to pass tests, but it makes them notably more efficient: repositories with fewer static-analysis violations reduce an agent's token use by about 7–8% and slash file revisitations by roughly one-third, implying cleaner code lowers compute and navigational overhead even when correctness is unchanged.

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study
Priyansh Trivedi, Olivier Schmitt · May 19, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
While code cleanliness does not change success rates for an autonomous coding agent, it meaningfully reduces computational cost and navigation friction—cutting token usage by 7–8% and file revisitations by 34%.

As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

Summary

Main Finding

Cleaning a codebase does not meaningfully change a coding agent’s success rate on tasks, but it materially reduces the agent’s operational footprint. In experiments with Claude Sonnet 4.6 across 33 agentic tasks and six minimal-pair repositories, cleaner code reduced token-equivalent metrics by ~7–8% and file revisitations by ~34%, while pass rate was effectively unchanged (−0.9 percentage points aggregated).

Key Points

  • Research question: Do structural/stylistic properties of code (“cleanliness”) affect agent correctness (RQ1) or operational footprint (RQ2), and does topology (single dense region vs. multi-module) matter (RQ3)?
  • Cleanliness proxy: SonarQube static-analysis rule violations and cognitive-complexity density.
  • Minimal-pair design: For each app, two behaviorally equivalent repositories were produced that differ mainly in cleanliness:
    • Slopify: degrades a clean repo by introducing SonarQube-flagged issues while preserving tests.
    • Vibeclean: cleans an unclean repo by fixing SonarQube-flagged issues while preserving tests.
  • Task design: 33 tasks across 6 pairs (13 cognitive-hotspot tasks, 14 multi-module tasks, 6 calibration tasks) that:
    • Route agents through regions with the largest cleanliness differences,
    • Describe required changes in externally observable terms (no internal names),
    • Validate via hidden tests at the application’s public surface.
  • Experimental protocol:
    • Agent: Claude Code (Claude Sonnet 4.6), default toolset, blind to repo side.
    • Trials: 10 runs per side per task → 660 trials; outlier filter removed ~9.7% of trials.
    • Metrics recorded: pass rate; input & output tokens; reasoning characters; conversation turns and pre-edit exploration measures; files read; file revisits; lines edited.
  • Aggregate quantitative results (micro-averaged across tasks):
    • Pass rate: −0.9 percentage points (no practical change).
    • Input tokens: −7.1%.
    • Output tokens: −8.5%.
    • Reasoning characters: −11.1%.
    • Conversation turns: −7.0%.
    • File revisitations: −33.8%.
    • Lines edited: −3.2% (small change).
  • Heterogeneity: Per-repo effects vary substantially (e.g., input token reductions ranged from modest to very large on some repos), and some repos showed small positive or negative pass-rate changes per repo, but no consistent correctness advantage.

Data & Methods

  • Construction:
    • Six minimal pairs (3 primarily Java, 3 primarily Python), three public and three private (to reduce memorization risk).
    • Clean/messy sides verified to be behaviorally equivalent via tests.
  • Pipelines:
    • Slopify (3 phases: build, explore, transform) introduces plausible organic mess (inlining helpers, duplication, dead code) but rejects edits that break tests.
    • Vibeclean (2 phases: build, clean) mechanically resolves analyzer issues and may extract helpers, remove dead code, etc.
  • Task generation:
    • Multi-agent assisted pipeline to map differences, propose task outlines, curate and convert to task + hidden tests + reference implementation; tasks that failed sanity checks were rewritten/dropped.
  • Evaluation:
    • Containerized sandboxes with bounded resources and identical environments except the source tree.
    • Metrics collected per-trial and micro-averaged across tasks. Outlier trials (>50% off per-task median) were removed before averaging.
  • Limitations called out by authors:
    • Single-model focus (Claude Sonnet 4.6); other models may react differently.
    • SonarQube is a proxy, not a full formalization of “cleanliness.”
    • Slopify and Vibeclean produce asymmetrical edits; the two directions are not exact inverses.
    • Task set and repo choices limit generalizability; private repos protect against memorization but reduce reproducibility.

Implications for AI Economics

  • Direct operational cost impact:
    • Token usage is a first-order driver of LLM-run costs. A consistent ~7–8% reduction in token consumption per run translates nearly linearly into cost savings per agent invocation (since input tokens dominated total token footprints in prior work).
    • Example framing: if input-dominated runs are common, a 7% token reduction reduces per-run token bills by roughly 7% (before accounting for provider pricing granularity, caching, or billing policy).
  • Latency and tooling costs:
    • Fewer file revisitations (~34% reduction) implies fewer tool calls, less I/O and less back-and-forth, which can reduce wall-clock time and orchestration overhead. That influences engineering throughput and infrastructure costs (fewer API/tool calls, less compute time).
  • Scale multiplies small per-run savings:
    • For organizations running many agent tasks (CI, automated refactoring, large-scale code generation), modest per-run savings compound to meaningful monthly/annual OPEX reductions.
  • Investment decision tradeoffs:
    • Correctness unaffected: cleaning does not increase the probability an agent completes a task correctly, so cleanliness investments are not justified by higher agent correctness alone.
    • Efficiency-driven ROI: organizations should evaluate cleanliness investments primarily through an efficiency lens (token/latency/operational costs) and possibly developer productivity (less agent churn, clearer diffs).
    • Prioritize cleaning hotspots: since tasks routed through messy hotspots incur the observed overheads, focus cleanup efforts on regions frequently visited by agents (e.g., shared services, orchestrators, core libraries).
  • Tooling and governance:
    • Code-quality gates, linters, and automated cleaning (Vibeclean-style tools) can be an economic lever alongside model selection, prompting, and caching strategies.
    • Incorporate codebase cleanliness as a variable in cost-forecasting models for agent-driven dev workflows and MLOps budgets.
  • Research & policy implications:
    • Firms and benchmarkers should include codebase properties (cleanliness) when estimating operating costs for agentic workflows.
    • Further studies should quantify monetary savings under realistic pricing, evaluate other models and agent toolchains, and compare cleanup costs vs. expected savings to inform optimal investment levels.
  • Short, actionable takeaways:
    • If you run many automated agent tasks, cleaning the codebases agents touch is likely to lower your LLM token bill and reduce time per task—even if it won’t make agents more correct.
    • Begin by identifying agent-visited hotspots and adding gating/linting or automated fixes for those modules to maximize cost-effectiveness.

If you’d like, I can: - Convert the token reductions into illustrative dollar estimates under a set of provider pricing assumptions, or - Produce a short checklist for engineering teams to prioritize cleanup activities likely to reduce agent costs.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The minimal-pair design gives strong internal validity for isolating the effect of code cleanliness on agent behaviour, and the study runs a substantial number of trials (660) with hidden tests to avoid overfitting. However, evidence is limited to a single agent (Claude Code), a small set of repository architectures and 33 tasks, and uses engineered transformations rather than a broad sample of real-world codebases, which constrains external validity and the strength of causal generalization. Methods Rigormedium — The paper uses a careful controlled protocol (matched repositories, bidirectional manipulations, hidden test suites) and reports both correctness and operational metrics (tokens, file revisits). Rigor is tempered by reliance on one model/harness configuration, possible subjectivity in how 'cleanliness' and cognitive complexity are operationalized, and a modest set of repo pairs/tasks that limit robustness checks across languages, scales, and agent designs. SampleSix repository pairs constructed into minimal-pair contrasts (clean vs. messy) yielding 33 authored tasks, evaluated in both directions (clean->messy and messy->clean) across 660 trials using the Claude Code agent; repositories are matched on architecture, dependencies, and external behaviour but differ in static-analysis rule violations and cognitive complexity; outcomes measured via hidden tests at the application's public surface, plus token consumption and file-revisit counts. Themesproductivity human_ai_collab adoption IdentificationConstructed minimal pairs of repositories that match on architecture, dependencies, and external behaviour but differ in static-analysis violations and cognitive complexity; pairs are created in both directions (degrade clean repos or clean messy ones) so that cleanliness is the manipulated factor; evaluation uses hidden tests at the public surface to measure agent performance and resource usage. GeneralizabilityResults come from a single LLM agent (Claude Code); other models may respond differently, Only six repository architectures and 33 tasks—may not represent the diversity of real-world codebases, Cleanliness manipulations are engineered and may not capture all forms of real-world technical debt or messy legacy code, Languages, frameworks, and enterprise-scale monoliths/microservices may behave differently (language bias if not diverse), Agent pipeline/harness, prompting, and tool integrations vary across deployments and can affect outcomes

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
We introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. Other positive high evaluation protocol (minimal-pair control of repository cleanliness)
0.8
The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. Other positive high directional construction of repository pairs (degrade or clean)
0.8
We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Other positive high number of tasks and pairs used in evaluation
n=33
six pairs
0.8
Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. Developer Productivity null_result high pass rate (task success on hidden tests)
n=660
0.48
Agents working on cleaner code use 7 to 8% fewer tokens. Organizational Efficiency positive high token usage (number of tokens consumed by agent pipelines)
n=660
7 to 8% fewer tokens
0.48
Agents working on cleaner code reduce file revisitations by 34%. Organizational Efficiency positive high file revisitations (number of times agents revisit files)
n=660
34% reduction
0.48
Traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Developer Productivity positive high relevance of maintainability principles to agent computational cost and navigation
0.48
Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours. Organizational Efficiency positive high factors materially affecting agent behaviour (operational footprint/navigation)
0.48

Notes