A meta-analysis of the effect of generative AI on productivity and learning in programming

Generative artificial intelligence (GenAI) is increasingly used for programming, yet it remains unclear when and where GenAI tools lead to productivity gains. Evidence on the effects of GenAI on the long-term development of programming skills is similarly mixed. Here, we present a meta-analysis of $n = 23$ studies reporting $k = 27$ effect sizes to quantify the effect of GenAI-powered coding assistants on productivity and learning. We systematically searched (i) ACM, (ii) arXiv, (iii) Scopus, and (iv) Web of Science for studies published between 2019 and 2025. Studies were required to compare GenAI-assisted with unassisted programming using quantitative measures of (1) productivity (i.e., task completion time, commits, and lines of code) and (2) learning (i.e., exam performance). We assessed the risk of bias using RoB2 and ROBINS-I and compared standardized effect sizes using Hedges' $g$. We find a statistically significant, but moderate positive effect of GenAI assistance on developer productivity ($g = 0.33$, $95\%$ CI: $[0.09, 0.58]$), yet with substantial heterogeneity across settings. Notably, productivity gains tend to be larger in controlled experimental settings, while effects are smaller in open-source and enterprise contexts. In contrast, we find no statistically significant effect of GenAI assistance on learning outcomes ($g = 0.14$, $95\%$ CI: $[-0.18, 0.47]$). Overall, these results highlight that GenAI coding assistants can increase developer productivity, although these gains depend strongly on context. In educational settings, however, the use of GenAI does not consistently translate into improved learning or skill development, which highlights the need for careful integration of GenAI into computer science education.

Summary

Main Finding

Generative-AI coding assistants produce a small-to-moderate, statistically significant increase in short-term developer productivity (Hedges’ g = 0.33, 95% CI [0.09, 0.58], p = 0.008), but effects are highly heterogeneous across contexts (I2 = 99%). In contrast, there is no reliable pooled effect on learning outcomes for students (Hedges’ g = 0.14, 95% CI [−0.18, 0.47], p = 0.389; I2 = 86%). Productivity gains are large in controlled lab experiments (g ≈ 0.73) but negligible in open‑source and enterprise settings.

Key Points

Scope: Pre-registered meta-analysis covering studies published 2019–2025. Aggregate: n = 23 studies, k = 27 effect sizes (productivity: 14 studies, 16 effect sizes; learning: 10 studies, 11 effect sizes).
Productivity result: pooled g = 0.33 (moderate positive); substantial between-study variance (τ2 = 0.22; I2 = 99%).
Learning result: pooled g = 0.14 (not statistically different from zero); τ2 = 0.25; I2 = 86%.
Heterogeneity driver: study setting is the only statistically significant moderator for productivity (QM(2)=10.51, p=0.005). Laboratory studies show large positive effects; enterprise and open-source studies show no significant effect.
Other moderators (GenAI interface, programming language, participant experience level, randomization, experimental design) were not significant predictors of effect-size variation.
Time trend: cumulative meta-analysis shows a stable, modest productivity gain over time (despite rapid model improvements).
Robustness checks:
- Leave-one-out: pooled g ranged 0.24–0.38; two studies (P002, P004) had disproportionate influence but did not reverse the overall finding.
- Risk-of-bias (RoB2 / ROBINS‑I): not a significant moderator, though lower-risk studies had smaller point estimates (g ≈ 0.08) than higher-risk ones (g ≈ 0.50).
- Publication-bias diagnostics: Egger’s test flagged funnel asymmetry (z = 4.52, p < 0.001); trim-and-fill imputed no missing studies, suggesting asymmetry may reflect heterogeneity rather than suppressed nulls.
Measurement: Productivity proxied by task completion time, commit counts, and lines of code; learning measured by exam performance. Primary tools studied included GitHub Copilot and other LLM-based assistants; most productivity studies used Python or multi-language tasks.

Data & Methods

Literature search: ACM, arXiv, Scopus, Web of Science; inclusion required quantitative comparison of GenAI-assisted vs. unassisted programming on productivity or learning outcomes.
Sample sizes: productivity studies included m ≈ 3,535 participants and r ≈ 6,355 repositories; learning studies included m ≈ 1,069 participants.
Effect-size metric: standardized mean differences computed as Hedges’ g.
Meta-analytic approach: random-effects models (REML estimator) and mixed-effects meta-regressions for moderator analysis.
Risk-of-bias assessment: RoB2 for RCTs and ROBINS‑I for non-randomized studies.
Sensitivity and bias checks: leave-one-out influence analysis, Cook’s d and studentized residuals for influential studies, Egger’s regression for funnel asymmetry, and trim-and-fill method.

Implications for AI Economics

Productivity and short-run labor effects
- Firms may realize modest short-term productivity gains from GenAI assistants, but the gains are context-dependent and appear concentrated in controlled/task-limited work. Expect heterogeneous returns to adoption across teams, codebases, and task types.
- Large measured effects in lab settings likely overstate real-world firm-level productivity impacts. Economic models and cost‑benefit analyses should treat lab-based estimates as upper bounds.
- Heterogeneity (I2 ≈ 99%) implies uneven returns across workers — potential for reallocation of tasks (toward review, integration, debugging) and changes in skill complementarities rather than uniform labor displacement.
Human capital and education
- No consistent evidence that GenAI improves core programming learning outcomes. Widespread classroom adoption could alter skill acquisition processes (cognitive offloading), with potential long-run effects on worker ability and credential signaling.
- Policymakers and educators should be cautious: credential validity, exam design, and curricula may need revision to ensure skill persistence and to measure true understanding rather than tool-assisted output.
Wages, task composition, and inequality
- If GenAI mainly augments productivity for some tasks/workers (e.g., those doing short, well-specified programming tasks), we may see differential wage gains and task reallocation rather than uniform labor-cost reductions.
- Differential adoption and effectiveness could exacerbate within-occupation inequality (those who integrate tools effectively capture more gains).
Firm strategy and investments
- To capture real-world productivity gains, firms likely need complementary investments: integration into workflows, quality assurance, guidelines to control technical debt, and training on effective human-AI collaboration.
- Potential hidden costs—higher maintenance, security vulnerabilities, and technical debt from AI-generated code—should be incorporated into ROI estimates.
Research and policy needs
- Economic models should incorporate contextual heterogeneity (task type, codebase familiarity, collaboration overhead) and longer-term outcomes (maintenance costs, skill formation).
- Priority empirical work: field experiments and longitudinal studies measuring durable productivity, code quality, maintenance costs, and worker learning trajectories.
- Education policy: trials of pedagogical interventions (e.g., structured AI use, forced unaided assessments, tool-aware curricula) to preserve foundational skills while leveraging AI benefits.

Short summary: GenAI coding assistants can raise short-term programmer productivity modestly, but effects depend crucially on context and do not reliably translate into improved learning. Economic assessments of AI in software development should account for strong heterogeneity, potential long‑run human-capital effects, and costs associated with code quality and maintenance.

Assessment

Paper Typereview_meta Evidence Strengthmedium — Synthesizes multiple studies with a systematic search and standard meta-analytic methods and finds a consistent positive productivity effect, but substantial heterogeneity across settings, mixed study designs (RCTs and non-randomized), potential publication bias, variation in outcome measures (task time, commits, LOC), and limited long-term evidence on skill development. Methods Rigorhigh — Comprehensive search across four major databases (ACM, arXiv, Scopus, Web of Science), pre-specified inclusion criteria for quantitative comparisons, use of Hedges' g for standardization, and formal risk-of-bias assessment with RoB2 and ROBINS-I; methods are appropriate and transparent, though interpretation is constrained by heterogeneity and primary-study quality. SampleMeta-analysis of 23 studies (27 effect sizes) published 2019–2025, covering controlled lab experiments, classroom/educational evaluations, and observational analyses of open-source and enterprise settings; outcomes include productivity measures (task completion time, commits, lines of code) and learning measures (exam performance); primary studies include randomized and non-randomized designs with varying sample sizes and populations (students and professional developers). Themesproductivity skills_training IdentificationMeta-analytic aggregation of 23 primary studies (27 effect sizes) comparing GenAI-assisted versus unassisted programming; standardized effect sizes (Hedges' g) pooled, with risk-of-bias assessed using RoB2 and ROBINS-I and subgroup analyses (e.g., controlled experiments vs open-source/enterprise) to probe heterogeneity; no new experimental assignment — causal claims rely on the internal validity of included primary studies. GeneralizabilityHeterogeneous study settings (lab experiments vs field/enterprise vs open-source) limit a single pooled estimate's external validity, Rapid evolution of GenAI models and tools (2019–2025) means effects may not generalize to current/future model capabilities, Variation in developer skill levels (students vs professionals) and tasks reduces applicability to specific workforce segments, Outcome measures (LOC, commits) are imperfect proxies for meaningful productivity and can differ in interpretation across contexts, Possible geographic or platform sampling biases in primary studies (e.g., over-representation of student samples or Western institutions)

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We conducted a meta-analysis of n = 23 studies reporting k = 27 effect sizes on GenAI-powered coding assistants. Other	null_result	high	meta-analytic sample (number of studies/effect sizes)	n=23 k = 27 effect sizes 0.4
GenAI assistance produces a statistically significant, moderate positive effect on developer productivity (Hedges' g = 0.33, 95% CI [0.09, 0.58]). Developer Productivity	positive	high	developer productivity (task completion time, commits, lines of code aggregated as productivity measures)	g = 0.33, 95% CI: [0.09, 0.58] 0.24
There is substantial heterogeneity in the productivity effects across settings. Developer Productivity	mixed	high	variation in productivity effect sizes across study contexts	0.24
Productivity gains tend to be larger in controlled experimental settings and smaller in open-source and enterprise contexts. Developer Productivity	mixed	medium	developer productivity by study context (experimental vs open-source vs enterprise)	0.14
GenAI assistance has no statistically significant effect on learning outcomes (Hedges' g = 0.14, 95% CI [-0.18, 0.47]). Skill Acquisition	null_result	high	learning (exam performance)	g = 0.14, 95% CI: [-0.18, 0.47] 0.24
Studies were required to compare GenAI-assisted with unassisted programming using quantitative measures of productivity (task completion time, commits, lines of code) and learning (exam performance). Other	null_result	high	study inclusion criteria (type of comparison and outcome measures)	0.4
Risk of bias in included studies was assessed using RoB2 and ROBINS-I. Other	null_result	high	risk of bias assessment methods	0.4
Overall, GenAI coding assistants can increase developer productivity, although these gains depend strongly on context. Developer Productivity	positive	high	developer productivity	g = 0.33 (pooled estimate) 0.24
In educational settings, the use of GenAI does not consistently translate into improved learning or skill development, highlighting the need for careful integration of GenAI into computer science education. Skill Acquisition	null_result	high	learning and skill development in educational settings	g = 0.14, 95% CI: [-0.18, 0.47] 0.04

GenAI coding assistants deliver modest productivity gains—about a third of a standard deviation—mostly in controlled experiments, but they do not reliably boost learning or skill acquisition in educational settings.