Large language models produce code whose readability is on par with human solutions overall, yet they show consistent, distinct readability weaknesses and respond only modestly to prompt tweaks; function signatures, constraints and style descriptions are the most influential prompt factors.
As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.
Summary
Main Finding
LLM-generated Python code has overall readability scores at least comparable to human-written code (5,869 real-world prompts across World of Code and LeetCode). However, LLM outputs exhibit distinct, recurring readability issue patterns (e.g., unnecessary complex structures, low-information comments, unknown/opaque API usage) that create a form of “hidden technical debt.” Prompt engineering (single-turn) can influence readability—function signatures, constraints, and style descriptions are the most impactful dimensions—but the total effect of prompting is limited and insufficient to eliminate the readability issues.
Key Points
- Dataset and scale
- 5,869 prompt–human-implementation pairs: 3,000 from World of Code (WoC) and 2,869 from LeetCode.
- Temporal filter: human baselines taken from code written before 2022 to reduce contamination by LLM-assisted code.
- Readability model
- A comprehensive, quantitative readability assessment integrating textual, structural, program, and visual features (building on prior work: Buse & Weimer, Posnett, Scalabrino, etc.).
- LLMs evaluated
- GPT-4o, Grok-3, Claude-3.7, DeepSeek-v3, Llama 3.1; Claude-3.7 performed best on readability for the controlled prompt experiments.
- Empirical results
- Overall readability: LLM-generated code ≳ human-written code on aggregate scores.
- Distinct issue patterns: LLM outputs more likely to show certain systematic problems (e.g., gratuitous control-flow complexity, low-value comments that do not aid comprehension, opaque use of APIs or non-idiomatic libraries).
- Thematic analysis
- Manual comparative labeling on sampled data: two annotators read pairs and labeled weaker dimensions and issue patterns.
- Sampleing: 500 WoC and 500 LeetCode pairs used in thematic analysis; within each set LLMs outperformed humans in many cases (WoC: 405 LLM>human, 95 human>LLM; LeetCode: 328 LLM>human, 172 human>LLM).
- Inter-rater reliability: Cohen’s kappa 0.87 for dimensionality assessment, 0.81 for issue-pattern identification.
- Prompt engineering (controlled experiments)
- Controlled prompt set (set B): 328 base tasks (164 HumanEval + 164 MBPP) expanded into 16 prompt variants each (5,248 operational prompt vectors).
- Prompt dimensions considered: style, function signature, IO contract, few-shot examples, task category, persona, constraints.
- Findings: function signatures, constraints, and style descriptions most influence readability, but aggregate improvements are modest—prompting alone is insufficient for reliably removing readability defects.
Data & Methods
- Data sources
- World of Code (WoC) vU (Oct 2021) — function-level extraction from Python files; docstrings converted into prompts; function bodies used as human baseline (3,000 pairs).
- LeetCode — problem descriptions and pre-2022 solutions in Python; problem text as prompt and core function implementation as baseline (2,869 pairs).
- Readability metric
- Constructed a unified readability model combining: textual/semantic cues, structural features (nesting, line length, indentation), program features (control-flow, API usage), and visual features.
- Grounded in prior validated featuresets (Buse & Weimer, Posnett, Scalabrino, Dorn).
- Experimental design
- RQ1 & RQ2: prompt set A (the 5,869 real-world prompts); generate code with multiple LLMs; compare readability scores and perform qualitative thematic analysis.
- RQ3: prompt set B (controlled): 328 tasks × 16 prompt variants; used Claude-3.7 for the large controlled sweep; vectorized prompt features recorded.
- Analysis
- Quantitative: readabilty score comparisons across LLM outputs and human baselines; statistical comparison across prompt variants.
- Qualitative: manual comparative annotation to identify readability-deteriorating dimensions and synthesize common issue patterns (nine patterns documented; annotators reconciled discrepancies).
- Triangulation across sources (WoC and LeetCode).
- Key methodological safeguards
- Pre-2022 baselines to reduce LLM contamination; cross-validation of docstring→prompt transformations; inter-annotator agreement reported.
Implications for AI Economics
- Productivity vs. hidden costs
- LLMs can produce code that is, on average, as readable as human code, implying potential near-term gains in developer productivity and reduced time-to-first-draft.
- Distinct readability issues that LLMs systematically introduce create hidden technical debt: extra human review, refactoring, or longer onboarding/maintenance time will be required, raising downstream labor and operational costs. Economic gains may therefore be smaller than naive productivity estimates that ignore maintainability.
- Labor demand and task composition
- Short-term: developers may shift away from boilerplate implementation toward oversight, review, specification, and maintenance tasks—demanding higher-skilled labor for auditing and refactoring.
- Long-term: if readability defects persist, market demand may grow for specialized code-review services, tools, or human-in-the-loop roles that manage LLM outputs.
- Cost-effectiveness of interventions
- Prompt engineering yields modest improvements; therefore, firms should weigh investments in prompt engineering vs. alternatives (model fine-tuning for readability, integrated linting/refactoring tools, automated readability-aware post-processing).
- Economic decisions should consider the marginal cost per unit readability improvement across options (prompts, fine-tuning, toolchains).
- Vendor incentives and product market
- Readability (a non-functional quality) is a measurable buyer concern; purchasers and enterprise customers may demand readability guarantees, audit tools, or contractual SLA elements about maintainability—this can shift vendor priorities toward optimizing for readability.
- A market for readability-specialized model fine-tuning, plugins, and third-party auditing services is likely to expand.
- Measurement & procurement
- Procurement frameworks and ROI models should include non-functional metrics (readability, maintainability) and expected lifecycle costs, not just time-to-delivery or unit-cost of generated code.
- Public-sector procurement or regulated industries should consider certification or standardized readability benchmarks to manage systemic risk from AI-generated code.
- Regulatory and policy considerations
- If LLMs introduce systematic, reproducible readability issues that increase maintenance burdens, regulators may require documentation, provenance, or minimum maintainability standards for AI-assisted code used in critical systems.
- Research & macroeconomic priorities
- To properly value LLMs in economic models of software production, future work should quantify the net productivity effect: balance reductions in authoring time against increased auditing/maintenance costs due to readability defects.
- Empirical estimates of lifecycle cost (initial generation + review + maintenance) across different degrees of LLM assistance would improve cost–benefit analyses for adoption decisions.
- Practical recommendations for firms
- Treat LLM-generated code as requiring mandatory readability audits and refactoring budgets in project planning.
- Prioritize investments in automated readability-check tools, model fine-tuning targeted at idiomatic/idiom-preserving generation, and standards/linters to reduce hidden technical debt.
- Use prompt design where low-cost and fast, but do not rely on prompting alone to ensure maintainability—combine with post-hoc verification/refactoring pipelines.
If helpful, I can: (a) draft a simple cost model illustrating trade-offs between faster generation and added review/maintenance costs under plausible parameters, or (b) extract a short checklist for procurement/specifications that incorporate readability criteria. Which would be most useful?
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Output Quality | positive | high | code_readability (measured via the proposed readability model) |
0.18
|
| We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode. Other | positive | high | coverage of evaluation / dataset size for readability assessment |
n=5869
0.3
|
| Current LLMs produce code with overall readability comparable to human-written code. Output Quality | null_result | high | code_readability (overall/readability score) |
n=5869
0.18
|
| LLM-generated code displays distinct readability issue patterns compared to human-written code. Output Quality | negative | high | readability_issue_patterns (feature-level readability problems) |
n=5869
0.18
|
| Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code. Output Quality | positive | high | impact_of_prompt_dimensions_on_readability |
n=5869
0.18
|
| The overall impact of prompt design on readability remains limited. Output Quality | null_result | high | overall_effect_of_prompt_design_on_readability |
n=5869
0.18
|
| Distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt in LLM-generated code that could affect long-term maintainability. Other | negative | high | maintainability_risk / technical_debt_inferred_from_readability |
n=5869
0.03
|