The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

AI agents produce modest refactoring gains but also introduce lint and security issues: 22.5% of agent commits raise measured code quality (usability up most), yet 24% of files add Pylint issues and 4.7% add Bandit findings — nonetheless, 73.5% of such PRs are merged.

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests
Mohamed Almukhtar, Anwar Ghammam, Hua Ming · May 20, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
In a study of agent-authored Python refactoring PRs from the AIDev dataset, agent commits improved at least one measured quality attribute in 22.5% of changes (usability improved most at 36.5%), while 24.17% of modified files introduced new Pylint issues and 4.7% introduced new Bandit findings, and 73.5% of PRs were merged.

As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

Summary

Main Finding

Agent-generated Python refactoring PRs produce measurable but modest quality gains: on average 22.5% of agentic commits improve at least one PyQu quality attribute (usability improves most often), while a non-trivial fraction introduce new static-analysis or security signals. Despite mixed automated-check outcomes, maintainers merge the majority of these PRs (73.5%), suggesting high practical acceptance but also highlighting the need for stronger tool-in-the-loop quality and security gating.

Key Points

  • Dataset and scope

    • Source: AIDev dataset filtered to popular (≥100 stars) Python repositories and PRs labeled as refactor.
    • Final sample: 438 refactoring PRs → 1,171 commits extracted → cleaned to 870 commits; 747 commits analyzed with PyQu (some excluded due to Python2 conversion issues).
    • File-level static analysis: 4,922 files present, 2,722 Python files, 2,528 files with usable paired Pylint/Bandit outputs.
  • Quality (PyQu) results (747 commits)

    • Average enhancement rate across five quality attributes: 22.5% of commits.
    • By attribute:
      • Usability: 36.5% (273/747)
      • Reliability: 27.6% (206/747)
      • Understandability: 24.0% (179/747)
      • Maintainability: 14.9% (111/747)
      • Modularity: 9.5% (71/747)
    • Confirmatory statistical tests (Mann–Whitney U; Cliff’s δ; BH FDR) show PyQu labels align with the mapped low-level metric deltas (strongest signals for understandability/reliability; weakest for modularity).
  • Static-analysis and security (file-level)

    • Pylint: 24.17% of analyzed files (611/2,528) introduced new Pylint findings after agent edits (mostly convention-level/stylistic violations like long lines).
    • Bandit: 4.7% of analyzed files (119/2,528) introduced new Bandit findings (largely risky practices; few high-severity vulnerabilities).
    • The authors derived a taxonomy of 24 recurring change operations from diffs and mapped operations to the lint/security identifiers they most often affected.
  • Developer acceptance and outcomes

    • PR-level: 438 refactor PRs; 322 merged (73.5%), 84 closed without merge; many closed PRs lacked explicit rationales in discussion logs.
    • Merges occurred even when automated checks regressed: 129 merged PRs introduced new Pylint issues; 27 merged PRs introduced new Bandit issues.
    • Manual analyses: two independent reviewers inspected “Not Enhanced” commits and high-impact issue removals (inter-rater κ ≈ 0.81–0.82).
  • Contributions released: replication package with extracted measurements and analysis scripts.

Data & Methods

  • Data selection: AIDev → filter to repositories with >100 stars, primary language Python, PRs labeled refactor.
  • Agents represented: OpenAI Codex (majority), GitHub Copilot, Devin, Cursor, Claude Code.
  • Quality measurement: PyQu — computes low-level code metrics (cyclomatic complexity, docs density, coupling/cohesion, etc.) and maps them via ML classifiers to five high-level quality attributes; labels a QA as “Enhanced” vs “Not Enhanced”.
  • Static analysis: Pylint (code-quality messages; filtered to exclude pure stylistic and unreliable import warnings) and Bandit (security findings; excluded test files and B101).
  • Statistical analysis: Mann–Whitney U tests to compare metric-delta distributions between Enhanced vs Not Enhanced; Cliff’s δ for effect size; Benjamini–Hochberg correction for multiple comparisons.
  • Manual validation: independent dual coding of samples (Not Enhanced subset, and top issue-removal cases); adjudication after measuring inter-rater agreement.
  • Limitations called out by authors: PyQu classifiers trained on Python ML projects (possible domain-transfer limits), scope restricted to refactor PRs in Python popular repos, some commits/files excluded due to tooling failures or Python2.

Implications for AI Economics

  • Productivity vs. quality trade-offs

    • Upside: Agentic refactorings can deliver measurable quality improvements (notably usability and reliability) and appear to be accepted frequently by maintainers, implying potential labor productivity gains and faster maintenance throughput.
    • Downside: Agents frequently introduce low-severity lint issues and occasionally security-related signals; these regressions create downstream costs (review time, remediation, increased CI churn) that partially offset productivity gains.
  • Adoption and diffusion economics

    • High merge rate (73.5%) despite regressions suggests low frictions to adoption in practice—maintainers may value speed/automation over conservative gating, or may lack efficient ways to triage agent-led changes. This can accelerate diffusion of agentic tools across projects but also amplify systemic risks if poor-quality changes propagate.
  • Risk externalities and governance

    • Even if most regressions are stylistic/low severity, the presence of new Bandit findings in ~4.7% of files suggests non-zero security externalities. Firms should internalize the hidden costs by investing in CI gating, standardized static-analysis policies, and audit logs for agent changes.
    • Liability/insurance: as agentic contributions become a routine part of production workflows, organizations and insurers will need models to price the risk of AI-produced code changes and to set underwriting conditions (e.g., mandatory automated checks, provenance requirements).
  • Labor-market effects and task reallocation

    • The observed pattern—agents often performing mechanical refactors and minor cleanups but less often improving modularity/architectural aspects—suggests partial task automation: routine, small-scale maintenance is automatable while higher-level design/architecture work remains human-intensive.
    • Economic impact is likely to be reallocation rather than wholesale displacement: reviewers and maintainers may shift toward oversight, triage, and higher-value engineering tasks; complementary investments in monitoring and verification skills will be economically valuable.
  • Cost-benefit and ROI considerations

    • Organizations should evaluate the ROI of deploying agentic refactoring by accounting for:
      • Time saved per merged PR (throughput),
      • Cost of additional lint/security regressions (CI time, developer rework),
      • Quality improvements that reduce downstream maintenance costs (e.g., usability/reliability gains),
      • Implementation costs for gating and auditability tooling.
    • Empirical measurement frameworks (like the paper’s PyQu + Pylint/Bandit pipeline) enable quantifying these components for economic decision-making.
  • Research and policy priorities

    • Need for cost-effectiveness studies estimating net productivity gains after remediation costs.
    • Standardized benchmarks and metrics for agentic-code risk to support market mechanisms (e.g., SLAs, liability clauses).
    • Regulatory and procurement implications for safety-critical systems: stronger mandatory verification and provenance guarantees for agent-generated code.

Caveats - Results are limited to Python refactoring PRs in popular GitHub repos and rely on PyQu (trained on ML projects) plus Pylint/Bandit; generalization to other languages, non-refactor tasks, or private/enterprise repositories is uncertain. - Many introduced Pylint findings are stylistic and may be low-cost to fix; Bandit findings vary in severity and require case-by-case assessment.

Overall, the paper provides actionable empirical inputs for economic assessments of deploying agentic code agents: they can boost maintenance throughput and produce measurable quality gains, but produce non-trivial noise (lint/security signals) that imposes verification and governance costs which organizations must budget for.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Uses real-world GitHub PRs and multiple automated quality/security tools to quantify outcomes, providing concrete empirical evidence on agent-authored refactorings; however, it is observational with no causal identification, relies on heuristic/static analyzers and an existing dataset (AIDev) that may be biased, limiting claims about broader effects. Methods Rigormedium — Combines a state-of-the-art ML-based quality assessor (PyQu) with established static analyzers (Pylint, Bandit) and a taxonomy derived from actual diffs, showing careful multi-tool triangulation; but validity depends on the accuracy and coverage of those tools, sample selection and labeling details are not provided in the abstract, and there is no experimental or longitudinal design to control confounders. SampleAgent-authored Python refactoring pull requests from the AIDev dataset (agentic refactoring PRs merged or proposed on GitHub); analyses operate at the file-change and PR level using PyQu for five quality attributes and Pylint/Bandit for static lint and security findings (exact PR/sample counts not provided in the summary). Themeshuman_ai_collab productivity GeneralizabilityDataset-limited: results come from the AIDev dataset and may not represent all agent workflows or models, Language-limited: only Python refactorings analyzed, excluding other languages and cross-language projects, Refactoring-focus: excludes feature/logic changes and non-refactoring agent contributions, Tool-limited: dependent on PyQu, Pylint and Bandit accuracy and coverage; dynamic issues and runtime behavior not captured, Open-source/GitHub bias: private repositories and enterprise workflows may differ, Temporal/model drift: findings may not generalize across different AI agent versions, prompts, or over time

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Agentic commits improve a quality attribute in 22.5% of the studied changes. Output Quality positive high improvement in any measured code quality attribute (per change)
22.5%
0.18
Usability is the quality attribute that improves most frequently, improving in 36.5% of the studied changes. Output Quality positive high usability (one of PyQu's quality attributes)
36.5%
0.18
24.17% of modified files introduce new Pylint issues, predominantly convention-level violations such as long lines. Output Quality negative high number/proportion of modified files with new Pylint issues (lint violations, mainly convention level)
24.17%
0.18
4.7% of modified files introduce new Bandit findings (security issues). Output Quality negative high presence of new Bandit security findings in modified files
4.7%
0.18
From the observed diffs, we derive a taxonomy of 24 recurring change operations. Other null_result high count and categorization of recurring change operations present in diffs
24 operations (taxonomy size)
0.18
73.5% of the analyzed PRs are merged (developer acceptance is high). Adoption Rate positive high PR merge rate (acceptance)
73.5%
0.18
Some merged PRs introduce new lint or security findings while simultaneously removing existing issues (i.e., merges sometimes involve both addition and removal of issues). Output Quality mixed high co-occurrence of introduced and removed lint/security findings in merged PRs
0.18
The study uses PyQu to quantify changes across five quality attributes for Python code. Output Quality null_result high five PyQu quality attributes (measured by the tool)
0.3
Given the mixed outcomes (some improvements, some new lint/security issues), stronger tool-in-the-loop quality and security gating is motivated for AI-driven development workflows. Governance And Regulation positive high policy/process recommendation (quality/security gating)
0.03

Notes