The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

LLM coding assistants speed up developers and cut routine work, but their effect on code quality and teamwork remains unresolved; most studies are short-term and exploratory, leaving long-run and team-level impacts unclear.

The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Review and Mapping Study
Amr Mohamed, Maram Assi, Mariam Guizani · April 27, 2026 · ACM Transactions on Software Engineering and Methodology
openalex review_meta medium evidence 7/10 relevance DOI Source PDF
A systematic review of 39 studies finds LLM assistants commonly accelerate development and automate repetitive tasks, but evidence on effects for code quality and team collaboration is mixed and most research is exploratory and short-term.

Large language model assistants (LLM-assistants) present new opportunities to transform software development. Developers are increasingly adopting these tools across tasks, including coding, testing, debugging, documentation, and design. Yet, despite growing interest, there is no synthesis of how LLM-assistants affect software developer productivity. In this paper, we present a systematic review and mapping of 39 peer-reviewed studies published between January 2014 and December 2024 that examine this impact. Our analysis reveals that the majority of studies report considerable benefits from LLM-assistants, though a notable subset identifies critical risks. Commonly reported gains include accelerated development, minimized code search, and the automation of trivial and repetitive tasks. However, studies also highlight concerns around cognitive offloading and reduced team collaboration. Our study reveals that whether LLM-based assistants improve or degrade code quality remains unresolved, as existing studies report contradictory outcomes contingent on context and evaluation criteria. While the majority of studies (90%) adopt a multi-dimensional perspective by examining at least two SPACE dimensions, reflecting increased awareness of the complexity of developer productivity, only 15% extend beyond three dimensions, indicating substantial room for more integrated evaluations. Satisfaction, Performance, and Efficiency are the most frequently investigated dimensions, whereas Communication and Activity remain underexplored. Most studies are exploratory (59%) and methodologically diverse, but lack longitudinal and team-based evaluations. This review surfaces key research gaps and provides recommendations for future research and practice. All artifacts associated with this study are publicly available at https://zenodo.org/records/18489222 .

Summary

Main Finding

This systematic review of 39 peer‑reviewed studies (2014–Dec 2024) finds that LLM‑based coding assistants generally produce measurable short‑term productivity benefits for software developers (faster development, less code search, automation of repetitive tasks, reduced task initiation overhead), but also introduce meaningful risks (cognitive offloading, reduced team collaboration, flow disruption). The effect on code quality remains unresolved—existing studies report contradictory outcomes depending on context, tasks, and evaluation criteria. Most primary studies are exploratory, methodologically diverse, and short‑term; team‑level and longitudinal evidence is scarce.

Reference: Amr Mohamed, Maram Assi, Mariam Guizani — “The Impact of LLM‑Assistants on Software Developer Productivity: A Systematic Review and Mapping Study” (39 studies synthesized; replication artifacts: https://zenodo.org/records/18489222).

Key Points

  • Evidence base
    • 39 primary studies (published 2014–Dec 2024) selected from 9,756 records across 6 databases; 228 full texts screened; 39 included.
    • Authors used Kitchenham & Charters SLR protocol, PRISMA flow, and validated search with control papers.
  • Overall reported effects
    • Common benefits: accelerated task completion, reduced code search, automation of trivial/repetitive tasks, lower task initiation overhead, better support for code‑adjacent work (documentation, tests).
    • Common risks: over‑reliance/cognitive offloading, decreased team communication or collaboration, flow disruption, possible propagation of subtle bugs or insecure patterns.
    • Code quality: mixed and context‑dependent — some studies find improvements, others degradation or no effect.
  • Productivity conceptualization and measurement
    • Studies mapped to the SPACE framework: Satisfaction & well‑being, Performance, Activity, Communication & collaboration, Efficiency & flow.
    • 90% of studies treat productivity multi‑dimensionally (≥2 SPACE dimensions); only ~15% examine >3 dimensions.
    • Most frequently studied: Satisfaction, Performance, Efficiency/Flow. Understudied: Communication and Activity.
  • Methodology and gaps
    • 59% of studies are exploratory; methods are heterogeneous (lab experiments, controlled tasks, surveys, observational analyses).
    • Major gaps: few longitudinal studies, few team‑level or organization‑level studies, limited real‑world deployment studies, inconsistent metrics for productivity and quality.
  • Artefacts and transparency
    • Authors released replication package and selection decisions publicly (zenodo link).

Data & Methods

  • Search & selection
    • Databases: ACM DL, IEEE Xplore, ScienceDirect, Web of Science, Scopus, SpringerLink.
    • Initial hits: 9,756; after deduplication: 8,953 records screened; full texts screened: 228; final included: 39.
    • Query strategy used iterative refinement, proximity operators where supported, and validation against 17 control papers.
  • Inclusion/exclusion highlights
    • Included: peer‑reviewed English papers (2014+) that investigate the effect of AI/LLMs on developer productivity.
    • Excluded: secondary studies, short papers (<4 pages), non‑peer reviewed grey literature, inaccessible texts, out‑of‑scope/out‑of‑focus works.
  • Analysis frameworks
    • Mapped each study onto the SPACE productivity framework; discussion augmented with McLuhan’s Tetrad to interpret socio‑technical implications.
  • Characteristics of primary studies
    • Study designs: lab experiments, controlled tasks, surveys, observational analyses of IDE/tool usage, case studies.
    • Focus: individual developer interactions with LLM‑assistants predominate; few team studies.
    • Temporal scope: largely short‑term snapshots; almost no long‑horizon follow‑ups.

Implications for AI Economics

  • Labor demand and task composition
    • Short‑term productivity gains suggest reduced time per task and potential reallocation of developer effort toward higher‑value tasks (design, architecture, coordination).
    • Economists should model task‑level substitution: LLMs substitute for routine coding/search tasks while complementing higher‑skill activities—implying shifts in relative demand for skills.
  • Wage and skill‑premium effects
    • If LLMs automate routine tasks, demand for mid‑level routine coding may fall while demand for senior/architectural, verification, and coordination skills rises—potentially increasing skill premia and polarization within software labor markets.
    • Heterogeneous effects by task, experience, and sector: junior developers may gain or lose depending on adoption, oversight requirements, and supervision structures.
  • Productivity vs. quality tradeoffs and externalities
    • Mixed code‑quality findings imply ambiguous impact on product reliability and consumer welfare. Economists should account for possible negative externalities (security/bug propagation) that raise downstream costs.
    • Firms may face tradeoffs between short‑term throughput gains and longer‑term maintenance costs; general equilibrium impacts depend on how quality is verified and regulated.
  • Organizational complementarities and team effects
    • Reduced communication/collaboration signals changed complementarities between tools and human coordination. Team‑level complementarities could amplify or dampen productivity gains; hence, firm‑level models need to include coordination frictions and knowledge spillovers.
  • Measurement and empirical research needs
    • Multi‑dimensional productivity: researchers should move beyond single proxies (LOC, task time) and integrate SPACE dimensions into empirical work (satisfaction, activity, communication, efficiency, performance).
    • Urgent need for longitudinal, team‑level, and field experiments (or administrative/firm panel data) to estimate durable effects on employment, wages, promotion, churn, and firm performance.
    • Use of administrative IDE/tool logs, matched employer‑employee data, and randomized encouragement/rollouts would improve causal identification.
  • Policy and market design
    • Regulation, standards, and certification for LLM‑generated code may be needed if negative externalities are nontrivial (security, liability).
    • Training and reskilling policies should emphasize supervisory, verification, and collaborative skills that complement LLMs.
  • Practical research recommendations for economists
    • Build task‑based models that explicitly separate routine vs non‑routine software tasks and model complementarities with human capital.
    • Estimate heterogeneous treatment effects by experience level, team structure, and sector; test interactions between tool adoption and managerial practices.
    • Quantify welfare effects including product quality, maintenance costs, and consumer risk; include dynamic adjustment (retraining, task creation).
    • Leverage the reviewed replication package and SPACE mapping as a taxonomy for constructing multi‑dimensional outcome variables.

Summary takeaway: LLM‑assistants appear to reallocate developer effort and raise short‑term productivity along several dimensions, but the long‑run labor, organizational, and product‑quality consequences are uncertain. Economists should prioritize task‑level, team‑level, and longitudinal empirical strategies and incorporate multi‑dimensional productivity metrics to estimate durable impacts and guide policy.

Assessment

Paper Typereview_meta Evidence Strengthmedium — The paper systematically synthesizes 39 peer-reviewed studies and identifies consistent patterns (speedups, automation of routine tasks) but the underlying evidence is heterogeneous, often exploratory, mostly short-term, and reports contradictory findings on code quality and collaboration — limiting confident causal conclusions. Methods Rigormedium — The study appears to follow a systematic mapping approach, covers a decade of peer-reviewed work, and publishes artifacts, but it relies on heterogeneous primary studies without a formal meta-analysis or standardized effect-size aggregation; quality assessment of included studies and exclusion of gray literature are potential limitations. SampleSystematic review and mapping of 39 peer-reviewed empirical studies published January 2014–December 2024 examining LLM-based assistants in software development tasks (coding, testing, debugging, documentation, design); most primary studies are exploratory, short-term, individual-level lab or field studies that measure outcomes across multiple SPACE dimensions (Satisfaction, Performance, Efficiency most common; Communication and Activity underexplored). Themesproductivity human_ai_collab GeneralizabilityRestricted to peer-reviewed literature (excludes industry/internal evaluations and gray literature), Primary studies are predominantly short-term and exploratory, limiting inference about long-run effects, Few team-based or longitudinal studies, so findings may not extend to multi-developer or organizational contexts, Heterogeneous LLMs, prompts, tasks, and metrics reduce comparability and external validity, Likely biased toward English-language and WEIRD research settings, Results focus on developer-level productivity and may not generalize to firm-level productivity or labor-market outcomes

Claims (14)

ClaimDirectionConfidenceOutcomeDetails
This paper is a systematic review and mapping of 39 peer-reviewed studies published between January 2014 and December 2024 that examine the impact of LLM-assistants on software developer productivity. Other null_result high scope of literature reviewed (count of studies)
n=39
0.4
The majority of reviewed studies report considerable benefits from LLM-assistants. Developer Productivity positive high overall reported impact on developer productivity
n=39
0.24
A notable subset of studies identifies critical risks associated with LLM-assistants. Other negative high reported risks and negative impacts
n=39
0.24
Commonly reported gains from LLM-assistants include accelerated development (faster task completion). Task Completion Time positive high task completion time / development speed
n=39
0.24
Commonly reported gains include minimized code search due to LLM assistance. Developer Productivity positive high time/effort spent searching for code or information
n=39
0.24
Commonly reported gains include the automation of trivial and repetitive tasks. Developer Productivity positive high automation of low-complexity tasks / developer time freed
n=39
0.24
Studies highlight concerns around cognitive offloading and reduced team collaboration when using LLM-assistants. Team Performance negative high cognitive processes and team collaboration
n=39
0.24
Whether LLM-based assistants improve or degrade code quality remains unresolved: existing studies report contradictory outcomes contingent on context and evaluation criteria. Output Quality mixed high code quality (e.g., correctness, maintainability, defects)
n=39
0.24
90% of the reviewed studies adopt a multi-dimensional perspective by examining at least two SPACE dimensions. Other null_result high proportion of studies examining >=2 SPACE dimensions
n=39
90%
0.4
Only 15% of the reviewed studies extend beyond three SPACE dimensions. Other null_result high proportion of studies examining >3 SPACE dimensions
n=39
15%
0.4
Satisfaction, Performance, and Efficiency are the most frequently investigated SPACE dimensions, whereas Communication and Activity remain underexplored. Other null_result high frequency of SPACE dimensions studied
n=39
0.4
Most studies are exploratory (59%) and methodologically diverse, but there is a lack of longitudinal and team-based evaluations. Other negative high study design types and presence/absence of longitudinal or team-based evaluations
n=39
59%
0.4
This review identifies key research gaps and provides recommendations for future research and practice. Other null_result high research gaps and recommendations (qualitative synthesis)
n=39
0.04
All artifacts associated with this study are publicly available at https://zenodo.org/records/18489222. Other null_result high availability of study artifacts
0.4

Notes