More-capable language models produce bloated, tightly coupled code that accrues architectural debt, and code volume reliably predicts this structural decay; functional correctness and detailed prompting do not stop the rot, implying AI-assisted software engineering must add architectural foresight to avoid mounting maintenance costs.

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby · May 04, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

AI-generated code accumulates a machine-specific form of technical debt: as model capability increases, generated code becomes larger and more coupled, with code volume serving as a near-perfect predictor of structural degradation, and neither correctness nor prompt engineering prevents this decay.

The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. This paper presents a systematic audit of technical debt in AI-generated software, revealing that AI does not eliminate flaws but rather introduces a distinct machine signature of defects. Our multi-scale analysis, spanning single-file algorithmic tasks and complex, agent generated systems, identifies a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code. This architectural decay is so pronounced that we establish a Volume-Quality Inverse Law, where code volume is a near perfect predictor of structural degradation. Crucially, we demonstrate that neither functional correctness nor detailed prompting mitigates this decay. These findings challenge the current paradigm of prompt-driven generation, reframing the central problem of AI-based software engineering from one of code generation to one of architectural complexity management. We conclude that future progress depends on equipping agents with explicit architectural foresight to ensure the software they build is not just functional, but also maintainable.

Summary

Main Finding

AI code-generation agents (both single LLMs and multi-agent systems) routinely produce functionally correct software that nevertheless accumulates distinct, systematically different technical debt. As model capability and system complexity increase, generated code becomes larger, more coupled, and architecturally degraded. The authors identify a Reasoning–Complexity trade-off and a near-deterministic "Volume–Quality Inverse Law": total lines of code (TLoC) is a near-perfect predictor of structural decay. Functional correctness and detailed prompting do not reliably prevent this architectural rot.

Key Points

Reasoning–Complexity Trade-off: More capable models (and agentic approaches) better solve logic but tend to concentrate complexity (e.g., long methods) or create tightly coupled modules to achieve correctness.
Volume–Quality Inverse Law: Code volume (TLoC) strongly predicts architectural degradation; bigger repositories produced by agents correlate with more and worse architectural smells.
Machine signature of technical debt: Agents show characteristic smells distinct from humans:
- Algorithmic tasks: Long Method, Temporal Fields (poor state encapsulation).
- Multi-file systems: God Class / Too Many Branches, Unstable Dependency (tight coupling), Redundant Implementations, Potential Improper API Usage.
Modular Mirage: Agents often create superficial modularity (files/packages) without semantic cohesion — the structure looks modular but components remain tightly coupled.
Prompting and correctness decoupled from maintainability: Few-shot / more detailed prompts and achieving runnable, correct code do not reliably reduce structural/architectural smells.
Taxonomy and longitudinal view: The paper offers a taxonomy of AI-specific smells and shows architectural rot increases with system scale over agent-produced development lifecycles.

Data & Methods

Two experiments:
Experiment I (algorithmic tasks) - Dataset: 90 problems sampled from CodeContest. - Models evaluated: Gemini 2.5 Pro, Llama 3.3:70b, deepseek-coder-v2:16b, qwen3-coder:30b and qwen3-coder:480b. - Prompting: Zero-shot baseline vs structured few-shot prompt with coding specs and boilerplate. - Baseline: Human-authored solutions (one per problem) analyzed similarly. - Analysis tool: PyExamine static analyzer to detect code-level, structural, and architectural smells; aggregated counts and distributions per solution.
Experiment II (system-level, multi-agent) - Framework: MetaGPT multi-agent orchestration (with Qwen-Coder 480B as the underlying LLM). - Task: Generate full multi-file Python repositories from product requirements; prompts required OOP and modular structure. - Analysis: Collected full repositories and applied PyExamine to detect higher-level architectural anomalies (e.g., cyclic dependencies, God objects). - Replication: Generated repositories and prompts packaged for release (per paper).
Key empirical observations: systematic increase in smell density with TLoC and model capability; prompting specificity and functional correctness did not mitigate smell accumulation.

Implications for AI Economics

Hidden maintenance costs undermine simple productivity gains:
- Short-term gains from faster code generation can be offset by increased long-term maintenance effort and technical debt remediation. Firms that evaluate AI benefits only on delivery speed or functional correctness risk underestimating total cost of ownership.
Reallocation of labor and changing skills premium:
- Demand is likely to shift from routine coding toward system architects, maintainers, and engineering managers capable of diagnosing and reversing agent-induced architectural decay. This may raise wages for architectural skills and refactoring expertise.
New markets and product opportunities:
- There is commercial value in tools and services that (a) detect AI-specific architectural smells, (b) automate architectural foresight and high-level planning, and (c) perform large-scale refactorings. Investors should prioritize startups and R&D that target architectural management for agent-generated code.
Investment and R&D priorities:
- Research that equips agents with explicit architectural foresight (planning, constraints, modularity enforcement, long-term state modeling) is crucial. Funding and corporate R&D should tilt toward agents that optimize maintainability metrics, not just correctness/performance.
Risk management, procurement, and contracting:
- Procurement processes and contracting for AI-assisted development should incorporate measures beyond functionality (e.g., TLoC budgets, smell thresholds, SLAs for maintainability, scheduled refactoring costs).
Competitive dynamics and strategic positioning:
- Companies that integrate architectural quality controls and specialized maintainability workflows will have durable advantages. Conversely, rapid proliferation of agent-generated but poorly maintainable code could raise systemic operational risks across software-dependent industries.
Measurement and valuation challenges:
- Traditional productivity metrics (lines of delivered code, time-to-first-working-version) become misleading when architectural decay is not accounted for. Firms and economists should develop better metrics for “sustainable productivity” that include technical debt discounting.
Policy and regulatory considerations:
- If agent-driven production materially increases systemic fragility (hard-to-maintain critical software), regulators or industry bodies may require standards for maintainability in certain domains (finance, healthcare, infrastructure), affecting compliance costs and adoption rates.

Actionable near-term recommendations for economic actors: - For firms: include technical-debt forecasts and TLoC-based risk checks in ROI models for agentic development pilots. - For investors/R&D managers: prioritize tooling and agent designs that explicitly model long-term architecture and enforce cohesion/coupling constraints. - For teams adopting agents: pair agents with human architects and invest in continuous architectural QA (smell detection, scheduled refactors) rather than treating generated code as final deliverables.

Summary takeaway: LLM agents materially change how code is produced but do not (yet) eliminate software engineering costs. The economic value of agentic coding will depend less on raw generation capacity and more on managing the architectural complexity and long-term maintainability that agents systematically introduce.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents a systematic, multi-scale empirical audit showing consistent correlations (a near-perfect predictive relationship between code volume and structural degradation) across tasks and agent systems, which provides substantive descriptive evidence; however, it does not establish causal mechanisms, relies on observational comparisons across models/tasks, and may be sensitive to dataset, metric, and model selection. Methods Rigormedium — The study is methodical in scope (single-file algorithmic tasks to complex agent systems) and introduces quantitative metrics (volume, coupling, structural degradation) and cross-model comparisons, but the description lacks details on sampling, model and prompt selection, statistical controls, robustness checks, and potential confounders such as task difficulty or style preferences, limiting reproducibility and internal validity. SampleA multi-scale corpus of AI-generated software including single-file algorithmic task solutions and larger multi-file/agent-produced systems generated by contemporary large language models and agent frameworks; exact models, counts, programming languages, and dataset sizes are not specified in the provided summary. Themesproductivity human_ai_collab adoption GeneralizabilityMay depend on the specific LLM families, versions, or prompting styles evaluated and not generalize to other or future models, Tasks considered (algorithmic single-file and agent-generated systems) may not represent the full diversity of real-world software projects and domains, Metrics of structural degradation and code volume may not fully capture real-world maintainability or developer workflows (human review, refactoring, CI/CD practices), Results might differ under human-in-the-loop development, team-based codebases, or when using specialized code-focused models or toolchains

Claims (7)

Claim	Direction	Confidence	Outcome	Details
The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. Output Quality	negative	high	emphasis of evaluation metrics (functional correctness vs maintainability)	0.03
AI does not eliminate software flaws but rather introduces a distinct 'machine signature' of defects in generated code. Output Quality	negative	high	presence and patterning of defects in AI-generated code (machine signature of defects)	0.18
There exists a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code. Output Quality	negative	high	code volume and coupling (architectural complexity)	0.18
We establish a Volume-Quality Inverse Law: code volume is a near perfect predictor of structural degradation. Output Quality	negative	high	structural degradation (predicted by code volume)	near perfect predictor 0.18
Neither functional correctness nor detailed prompting mitigates this architectural decay in AI-generated code. Output Quality	null_result	high	degree of architectural decay (despite functional correctness and prompt engineering)	0.18
Prompt-driven generation (even with detailed prompting) fails to address the central problem of architectural complexity management in AI-based software engineering. Output Quality	null_result	high	effectiveness of prompting on reducing architectural complexity	0.18
Future progress in AI-based software engineering depends on equipping agents with explicit architectural foresight so generated software is maintainable, not just functional. Organizational Efficiency	positive	medium	need for architectural foresight in agent design to improve maintainability	0.02