The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Autonomous coding agents are increasingly active in open-source projects, but their patches are more fragile — agent-authored pull requests are rising in frequency yet are revised or removed at higher rates than human contributions, raising maintainability concerns.

Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time
Razvan Mihai Popescu, David Gros, Andrei Botocan, Rahul Pandita, Prem Devanbu, Maliheh Izadi · April 01, 2026
arxiv correlational medium evidence 7/10 relevance Source PDF
Using a novel dataset of ~110k GitHub pull requests, the paper finds rising activity from autonomous coding agents but shows agent-generated code experiences higher churn and lower survival over time compared with human-authored code.

The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately $110,000$ open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.

Summary

Main Finding

Agentic coding systems (five agents: OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, Devin) are increasingly active in GitHub pull-request workflows, especially in lower-starred repositories. Using a new curated dataset of ~110,000 PRs (June–Aug 2025), the authors find that although agents accelerate some development activity, agent-authored code exhibits higher subsequent churn and lower long-term survival than human-authored code — i.e., agent contributions tend to be revised or replaced more often over time.

Key Points

  • Scope and dataset: novel, large-scale dataset of ≈110k PRs (with associated commits, comments, reviews, issues, file changes) sampled from agent-labeled PR activity over a three-month window (June–Aug 2025). The dataset spans millions of lines of code and five representative coding agents.
  • Agents studied: OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, Devin. Detection used agent-specific signals (branch name prefixes for some agents; bot author fields or watermark strings for others).
  • Activity patterns:
    • Agent PR activity is growing and concentrated disproportionately in low-star (less popular) repositories.
    • Agents differ in usage patterns (merge frequency, merge latency, types of files edited, commit density, review and comment patterns).
  • Evolution over time:
    • Agent-generated code shows higher churn and lower survival rates compared to human-authored code, implying more follow-up edits and maintenance effort after initial merge.
  • Contribution characteristics analyzed: merge rates and latency, change complexity, file composition, commit/review density, repository characteristics.
  • Dataset released for follow-up research (link provided in paper).

Data & Methods

  • Data source and period: GitHub GraphQL API; extraction window June–August 2025 (chosen to include agents’ availability dates and capture contemporaneous activity).
  • Agent identification: agent-specific signals
    • Branch prefix searches (e.g., head:codex/, head:copilot/) for agents that create pseudo-author branches.
    • Bot author fields for agents registered as GitHub apps (e.g., author:devin-ai-integration[bot], author:google-labs-jules[bot]).
    • Authorship/watermark strings (e.g., “Co-Authored-By: Claude” or “Generated with Claude Code”).
  • Sampling strategy:
    • Full traversal of the three-month window with stratified sampling frequency per agent (denser for high-volume agents, sparser for low-volume) and a per-day upper limit to avoid domination by very active agents.
    • For lower-activity agents (Jules, Devin) all PRs in the period were included.
  • Collected metadata per PR: up to first 100 commits, comments, issues, reviews, and modified files (consistent with GitHub page sizes). Duplicates and pagination artifacts were cleaned.
  • Analyses performed:
    • Cross-sectional comparisons of agent vs. human PRs: merge rates, merge latency, change size/complexity, file type distributions, commit/review/comment density, repository-star distribution.
    • Longitudinal analyses: survival estimates and churn metrics for code contributed by agents vs humans (measuring follow-up edits / deletions over time).
  • Limitations acknowledged by authors:
    • Three-month snapshot — may miss longer-term trends and agent evolution outside this window.
    • Agent detection relies on "tell-tale" signals (branch prefixes, bot authorship, watermarks); some agent contributions could be missed or misattributed.
    • Sampling designed for representativeness across agents but cannot recover the full population (Codex had ~1.1M PRs in the period; authors used a subsample).
    • Observational (correlational) design — cannot attribute causality for higher churn to agent use alone (confounders like repo quality, task type, developer oversight may matter).

Implications for AI Economics

  • Productivity vs maintenance trade-off: short-term productivity or throughput gains from autonomous agents (faster PR creation, possible faster merges) may be offset by higher downstream maintenance costs (increased churn, lower code survival). Economic analyses of AI-driven developer productivity must quantify maintenance and revision costs, not just immediate output rates.
  • Labor-market effects: increased use of autonomous agents may shift demand within software teams away from routine code-authoring toward review, integration, and maintenance tasks. This favors workers who can supervise, audit, and maintain AI-generated code; it may reduce demand for entry-level coding tasks but increase demand for verification/auditing roles.
  • Firm incentives and platform design: repository owners and firms need to consider incentives and policies (e.g., limiting autonomous merges, stricter review requirements) to internalize maintenance costs. Platforms (GitHub, cloud IDEs) and vendors may monetize quality-control, verification, or monitoring services for agentic output.
  • Quality externalities and risk: concentration of agent activity in lower-quality or low-star repos suggests potential negative externalities (propagation of technical debt across the open-source ecosystem). This raises insurance, liability, and software supply-chain risk considerations—relevant for firms relying on open-source dependencies.
  • Market opportunities: higher churn in agented code creates demand for complementary services — automated testing, continuous monitoring, AI-code auditors, maintenance-as-a-service offerings — altering the structure of the software tool/service market.
  • Measurement for macro productivity: macroeconomic measures of AI-driven productivity gains in software should incorporate life-cycle metrics (including maintenance churn and survival), otherwise estimates will be biased upward. Longitudinal accounting will be necessary to assess net welfare gains.
  • Research & policy directions:
    • Need for causal studies and randomized deployments to estimate net welfare and labor impacts.
    • Longer-term monitoring to see whether churn reduces as agent quality and integration improve.
    • Standards and labeling for agent-produced code to improve traceability and enable better economic measurement and governance.

Suggested next steps for researchers/policymakers interested in AI economics: - Incorporate maintenance/churn costs into productivity and labor demand models. - Run firm-level or repo-level experiments that randomize agent usage to identify causal effects on output quality and maintenance burden. - Evaluate the economic value of verification/auditing tools and related business models. - Monitor inequality effects across developer skill levels and geographic markets as agent use scales.

If you want, I can (a) extract a concise set of quantitative results from the paper (merge rates, churn rates, sample sizes per agent) assuming you want numeric details, or (b) draft specific economic models/extensions that incorporate the churn findings into productivity/labor-demand estimates. Which would be most helpful?

Assessment

Paper Typecorrelational Evidence Strengthmedium — The paper uses a large, novel observational dataset (~110k pull requests) and provides descriptive and longitudinal comparisons between agent- and human-authored contributions, which supports reliable associations; however, it lacks a credible causal identification strategy (no randomized assignment, natural experiment, or strong quasi-experimental design) and thus cannot rule out selection, confounding, or measurement biases that could explain differences in churn and survival. Methods Rigormedium — Data collection appears comprehensive and the authors analyze multiple signals (commits, comments, reviews, file diffs, survival/churn over time), suggesting careful empirical work; but the rigor is limited by (a) potential error in labeling agent vs human contributions, (b) lack of adjustments for confounders that determine where agents are used (project type, contributor experience, PR size/complexity), and (c) no robustness checks or quasi-experimental leverage reported to support causal claims. SampleApproximately 110,000 open-source GitHub pull requests with associated commits, comments, reviews, issues, and file changes (millions of lines of code), covering contributions labeled as from five coding agents (OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, Devin) alongside human-authored PRs; timeframe and exact repo inclusion criteria not specified in the summary. Themeshuman_ai_collab productivity adoption GeneralizabilityRestricted to public/open-source GitHub repositories — may not generalize to private, enterprise, or proprietary codebases., Likely biased toward projects and developers willing to use agents; selection on unobservables (task difficulty, contributor skill) may confound comparisons., Findings tied to specific agents and versions evaluated; results may change as agents improve., Potential language/tech-stack skew depending on sampled repositories., Time-bound to an early-adoption period; future agent behavior and project practices may differ.

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
We construct a novel dataset of approximately 110,000 open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. Other null_result high number of pull requests and total lines of source code in dataset
n=110000
0.5
We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Adoption Rate null_result high merge frequency, edited file types, developer interaction signals (comments, reviews)
n=110000
0.3
We offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Output Quality null_result high survival rates and churn rates of code contributions
n=110000
0.3
Our findings indicate an increasing agent activity in open-source projects. Adoption Rate positive high agent activity / contributions in open-source projects over time
n=110000
0.3
Agent contributions are associated with more churn over time compared to human-authored code. Output Quality negative high code churn rate over time (agent-generated vs human-authored)
n=110000
0.3
Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Adoption Rate positive medium presence of agent-originated development activities (branches, PRs, reviews)
0.18
Code authoring and review are only a small part of the larger software engineering process; the resulting code must also be maintained and updated over time. Other null_result high relative share of authoring/review versus maintenance in software development (conceptual)
0.05

Notes