Large language models produce code whose readability is on par with human solutions overall, yet they show consistent, distinct readability weaknesses and respond only modestly to prompt tweaks; function signatures, constraints and style descriptions are the most influential prompt factors.

The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

Hengzhi Ye, Fengyuan Ran, Weiwei Xu, Minghui Zhou · May 13, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Across 5,869 coding scenarios, mainstream LLMs produce code whose overall readability measures are comparable to human-written solutions, but they exhibit distinct patterns of readability issues and only limited improvement from prompt engineering.

As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.

Summary

Main Finding

LLM-generated Python code has overall readability scores at least comparable to human-written code (5,869 real-world prompts across World of Code and LeetCode). However, LLM outputs exhibit distinct, recurring readability issue patterns (e.g., unnecessary complex structures, low-information comments, unknown/opaque API usage) that create a form of “hidden technical debt.” Prompt engineering (single-turn) can influence readability—function signatures, constraints, and style descriptions are the most impactful dimensions—but the total effect of prompting is limited and insufficient to eliminate the readability issues.

Key Points

Dataset and scale
- 5,869 prompt–human-implementation pairs: 3,000 from World of Code (WoC) and 2,869 from LeetCode.
- Temporal filter: human baselines taken from code written before 2022 to reduce contamination by LLM-assisted code.
Readability model
- A comprehensive, quantitative readability assessment integrating textual, structural, program, and visual features (building on prior work: Buse & Weimer, Posnett, Scalabrino, etc.).
LLMs evaluated
- GPT-4o, Grok-3, Claude-3.7, DeepSeek-v3, Llama 3.1; Claude-3.7 performed best on readability for the controlled prompt experiments.
Empirical results
- Overall readability: LLM-generated code ≳ human-written code on aggregate scores.
- Distinct issue patterns: LLM outputs more likely to show certain systematic problems (e.g., gratuitous control-flow complexity, low-value comments that do not aid comprehension, opaque use of APIs or non-idiomatic libraries).
Thematic analysis
- Manual comparative labeling on sampled data: two annotators read pairs and labeled weaker dimensions and issue patterns.
- Sampleing: 500 WoC and 500 LeetCode pairs used in thematic analysis; within each set LLMs outperformed humans in many cases (WoC: 405 LLM>human, 95 human>LLM; LeetCode: 328 LLM>human, 172 human>LLM).
- Inter-rater reliability: Cohen’s kappa 0.87 for dimensionality assessment, 0.81 for issue-pattern identification.
Prompt engineering (controlled experiments)
- Controlled prompt set (set B): 328 base tasks (164 HumanEval + 164 MBPP) expanded into 16 prompt variants each (5,248 operational prompt vectors).
- Prompt dimensions considered: style, function signature, IO contract, few-shot examples, task category, persona, constraints.
- Findings: function signatures, constraints, and style descriptions most influence readability, but aggregate improvements are modest—prompting alone is insufficient for reliably removing readability defects.

Data & Methods

Data sources
- World of Code (WoC) vU (Oct 2021) — function-level extraction from Python files; docstrings converted into prompts; function bodies used as human baseline (3,000 pairs).
- LeetCode — problem descriptions and pre-2022 solutions in Python; problem text as prompt and core function implementation as baseline (2,869 pairs).
Readability metric
- Constructed a unified readability model combining: textual/semantic cues, structural features (nesting, line length, indentation), program features (control-flow, API usage), and visual features.
- Grounded in prior validated featuresets (Buse & Weimer, Posnett, Scalabrino, Dorn).
Experimental design
- RQ1 & RQ2: prompt set A (the 5,869 real-world prompts); generate code with multiple LLMs; compare readability scores and perform qualitative thematic analysis.
- RQ3: prompt set B (controlled): 328 tasks × 16 prompt variants; used Claude-3.7 for the large controlled sweep; vectorized prompt features recorded.
Analysis
- Quantitative: readabilty score comparisons across LLM outputs and human baselines; statistical comparison across prompt variants.
- Qualitative: manual comparative annotation to identify readability-deteriorating dimensions and synthesize common issue patterns (nine patterns documented; annotators reconciled discrepancies).
- Triangulation across sources (WoC and LeetCode).
Key methodological safeguards
- Pre-2022 baselines to reduce LLM contamination; cross-validation of docstring→prompt transformations; inter-annotator agreement reported.

Implications for AI Economics

Productivity vs. hidden costs
- LLMs can produce code that is, on average, as readable as human code, implying potential near-term gains in developer productivity and reduced time-to-first-draft.
- Distinct readability issues that LLMs systematically introduce create hidden technical debt: extra human review, refactoring, or longer onboarding/maintenance time will be required, raising downstream labor and operational costs. Economic gains may therefore be smaller than naive productivity estimates that ignore maintainability.
Labor demand and task composition
- Short-term: developers may shift away from boilerplate implementation toward oversight, review, specification, and maintenance tasks—demanding higher-skilled labor for auditing and refactoring.
- Long-term: if readability defects persist, market demand may grow for specialized code-review services, tools, or human-in-the-loop roles that manage LLM outputs.
Cost-effectiveness of interventions
- Prompt engineering yields modest improvements; therefore, firms should weigh investments in prompt engineering vs. alternatives (model fine-tuning for readability, integrated linting/refactoring tools, automated readability-aware post-processing).
- Economic decisions should consider the marginal cost per unit readability improvement across options (prompts, fine-tuning, toolchains).
Vendor incentives and product market
- Readability (a non-functional quality) is a measurable buyer concern; purchasers and enterprise customers may demand readability guarantees, audit tools, or contractual SLA elements about maintainability—this can shift vendor priorities toward optimizing for readability.
- A market for readability-specialized model fine-tuning, plugins, and third-party auditing services is likely to expand.
Measurement & procurement
- Procurement frameworks and ROI models should include non-functional metrics (readability, maintainability) and expected lifecycle costs, not just time-to-delivery or unit-cost of generated code.
- Public-sector procurement or regulated industries should consider certification or standardized readability benchmarks to manage systemic risk from AI-generated code.
Regulatory and policy considerations
- If LLMs introduce systematic, reproducible readability issues that increase maintenance burdens, regulators may require documentation, provenance, or minimum maintainability standards for AI-assisted code used in critical systems.
Research & macroeconomic priorities
- To properly value LLMs in economic models of software production, future work should quantify the net productivity effect: balance reductions in authoring time against increased auditing/maintenance costs due to readability defects.
- Empirical estimates of lifecycle cost (initial generation + review + maintenance) across different degrees of LLM assistance would improve cost–benefit analyses for adoption decisions.
Practical recommendations for firms
- Treat LLM-generated code as requiring mandatory readability audits and refactoring budgets in project planning.
- Prioritize investments in automated readability-check tools, model fine-tuning targeted at idiomatic/idiom-preserving generation, and standards/linters to reduce hidden technical debt.
- Use prompt design where low-cost and fast, but do not rely on prompting alone to ensure maintainability—combine with post-hoc verification/refactoring pipelines.

If helpful, I can: (a) draft a simple cost model illustrating trade-offs between faster generation and added review/maintenance costs under plausible parameters, or (b) extract a short checklist for procurement/specifications that incorporate readability criteria. Which would be most useful?

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper analyzes a large sample (5,869 coding scenarios) and uses a comprehensive, multi-dimensional readability model to compare LLM-generated and human code, providing systematic empirical evidence; however, findings rely on an automated readability metric (potential measurement/construct validity concerns), limited description of LLM versions and languages/tasks, and do not establish causal effects on downstream economic outcomes like productivity or maintenance costs. Methods Rigormedium — The study is methodical in feature design (textual, structural, program, visual), uses large-scale data from World of Code and LeetCode, and systematically varies prompt dimensions, but it depends on an induced readability model whose external validity and inter-rater calibration versus human reviewers are not fully documented, and the prompt and model scope/selection choices could introduce selection biases. SampleCode examples drawn from 5,869 scenarios collected from large code repositories/benchmarks including World of Code (WoC) and LeetCode; for each scenario, code was generated by multiple mainstream LLMs under varied prompt configurations and compared against human-written solutions from the same sources; readability evaluated via a synthesized model combining textual, structural, program, and visual features. Themeshuman_ai_collab productivity GeneralizabilityMay not generalize beyond the tasks represented in WoC and LeetCode (benchmark/problem-solving code vs. large production codebases)., Potentially limited to the programming languages and LLM versions tested (not specified; results may change with newer models)., Automated readability model may not capture all facets of human judgments or team-specific style conventions., Prompt space explored may be a subset of real-world prompt-engineering practices; developer workflows and code-review practices vary across organizations.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Output Quality	positive	high	code_readability (measured via the proposed readability model)	0.18
We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode. Other	positive	high	coverage of evaluation / dataset size for readability assessment	n=5869 0.3
Current LLMs produce code with overall readability comparable to human-written code. Output Quality	null_result	high	code_readability (overall/readability score)	n=5869 0.18
LLM-generated code displays distinct readability issue patterns compared to human-written code. Output Quality	negative	high	readability_issue_patterns (feature-level readability problems)	n=5869 0.18
Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code. Output Quality	positive	high	impact_of_prompt_dimensions_on_readability	n=5869 0.18
The overall impact of prompt design on readability remains limited. Output Quality	null_result	high	overall_effect_of_prompt_design_on_readability	n=5869 0.18
Distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt in LLM-generated code that could affect long-term maintainability. Other	negative	high	maintainability_risk / technical_debt_inferred_from_readability	n=5869 0.03