← Papers

In a 22-person lab experiment, GitHub Copilot raised immediate coding performance and reduced workload compared with human pair partners, but human partners produced more positive emotional experiences and possibly better one-week retention.

Fast and Forgettable: A Controlled Study of Novices' Performance, Learning, Workload, and Emotion in AI-Assisted and Human Pair Programming Paradigms

Nicholas Gardella, James Prather, Juho Leinonen, Paul Denny, Raymond Pettit, Sara L. Riggs · April 20, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

In a controlled lab study of 22 student programmers, GitHub Copilot produced higher short-term coding performance and lower workload compared with human pair partners, but human partners elicited more positive and arousing emotional responses and there was a suggestive (non-significant) larger retention decline after one week in the AI condition.

Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years. While research and pedagogy are beginning to cope with this change, computing students are left to bear the unforeseen consequences of AI amidst a dearth of empirical evidence about its effects. Though pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement, it remains underutilized and further threatened by the proposition that AI can replace a human programming partner. In this paper, we present a controlled pair programming study with 22 participants who wrote Python code under time pressure in teams of two and individually with GitHub Copilot for 20 minutes each. They were incentivized by bonus compensation to balance performance with understanding and were retested individually on the programming tasks after a retention interval of one week. Subjective measures of workload and emotion as well as objective measures of performance and learning (retest performance) were collected. Results showed that participants performed significantly better with GitHub Copilot than their human teammate, and several dimensions of their workload were significantly reduced. However, the emotional effect of the human teammate was significantly more positive and arousing as compared to working with Copilot. Furthermore, there was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition. We recommend that educators strongly consider revisiting pair programming as an educational tool in addition to embracing modern AI.

Summary

Main Finding

In a controlled within-subjects study of 22 novice/intermediate Python programmers, participants solved timed programming tasks significantly better and with lower workload when paired with GitHub Copilot than when paired with a human teammate. However, human partners produced more positive and arousing emotional experiences, and AI-assisted work showed a (nonsignificant) drop in one-week retention and a larger retest performance decrement than human-paired work. Authors conclude that AI gives short-term performance and workload benefits but may weaken affective, motivational, and possibly longer-term learning advantages of human-human pair programming.

Key Points

Study design: within-subjects comparison (each participant experienced both paradigms) of human-human pair programming vs. human-AI (GitHub Copilot) programming; two 20-minute sessions on matched sets of short Python tasks, with a one-week retention/retest session.
Primary outcomes collected: objective performance during sessions, learning retention (retest performance), subjective workload measures, and self-reported emotion (valence and arousal).
Core results:
- Performance: Participants completed tasks significantly better with GitHub Copilot than with a human teammate under time pressure.
- Workload: Several dimensions of workload were significantly reduced when using Copilot.
- Emotion: Working with a human teammate elicited significantly more positive and higher-arousal emotional responses than working with Copilot.
- Retention: The AI condition showed a nonsignificant absolute reduction in one-week retest performance and a larger retest performance decrement relative to the human condition.
Theoretical framing: Authors interpret results using Control-Value Theory (emotional valence/arousal and learning foci) and Cognitive Load Theory (intrinsic/germane/extraneous load). They hypothesize AI reduces extraneous/intrinsic load and immediate friction but may diminish beneficial conflict, social motivation, and metacognitive engagement that support deeper learning.
Recommendation: Educators should both embrace AI tools for their efficiency gains and re-evaluate (or re-emphasize) human pair programming as an educational practice to preserve affective and learning benefits.

Data & Methods

Participants: 22 volunteers (mostly undergraduate CS students; mean age ≈ 20.8; 16 male / 6 female; mixed ethnicities), screened for novice-to-intermediate Python ability and asked to abstain from Python study during the one-week retention interval.
Technical setup: VS Code on Windows with two identical workstations. One IDE had GitHub Copilot (Copilot Chat using GPT-4.1) enabled; the other had standard Python IntelliSense. Browsing and use of other browser-based AI tools were disallowed and monitored.
Tasks: Eight short Python problems drawn from the HumanEval dataset, organized into two matched 4-task pools (≈5 minutes per task, 20 minutes per condition). Problem prompts were provided as PDF (so Copilot could not automatically ingest them); participants had to supply context to Copilot via chat if desired. Copilot was verified to be capable of producing passing solutions for these tasks in one shot, but participants were not told this.
Procedure and incentives: Each participant completed both conditions (human pair and Copilot) in counterbalanced order, writing code under time pressure and with a monetary/bonus incentive designed to balance speed/performance and understanding. After one week, participants were retested individually on the same tasks to measure retention. The IRB allowed limited deception for design integrity; participants were debriefed.
Measures: Objective performance (task completion / correctness on automated tests), one-week retest performance (learning/retention), subjective workload (multi-dimensional), and self-reported emotional valence and arousal (Control-Value Theory perspective). Exact instruments (e.g., NASA-TLX or specific emotion scales) are described in the full paper; analyses reported significance tests comparing conditions.

Implications for AI Economics

Short-run productivity vs. long-run human capital formation:
- AI tools like Copilot can raise immediate output and lower effort per task (productivity gains), which firms and educational institutions will find attractive.
- However, the evidence here suggests possible reductions in knowledge retention and metacognitive engagement when AI replaces human collaboration—implying potential long-term depreciation of novices' skills. Economic models of AI adoption should therefore account for dynamic trade-offs between short-run productivity and long-run human capital accumulation.
Labor demand and skill composition:
- If AI reduces the routine problem-solving practice novices receive, future entry-level workers may be less prepared for tasks that require independent debugging, specification, or conceptual understanding. Demand may therefore shift toward workers who can supervise, audit, and augment AI, increasing premiums on higher-level metacognitive and collaborative skills.
Complementarity and substitution:
- Findings underscore complementarity rather than pure substitution: combining AI assistance with human-led pedagogical practices (e.g., pair programming, mentorship) may capture both productivity and learning/affective benefits. Policy and firm-level decisions should favor hybrid arrangements that preserve social learning externalities.
Welfare and workplace well-being:
- Human partners produced more positive and stimulating emotional experiences. For organizational outcomes tied to morale, retention, and creative collaboration, replacing interpersonal work with AI-only interactions could yield hidden costs (reduced engagement, team cohesion) that offset efficiency gains.
Incentive design and assessment:
- Educational incentives and workplace evaluation need adjustment to avoid encouraging overreliance on AI (which can inflate short-term performance metrics). Assessments should measure retained competence and ability to perform without AI to discourage degradation of core skills.
Policy and investment priorities:
- Investment in training curricula should explicitly teach skill maintenance under AI augmentation (e.g., debugging AI output, prompting for learning rather than solutions).
- Public policy and educational funding should recognize potential negative externalities of wholesale AI substitution on human capital and support programs that integrate AI while preserving interpersonal learning.
Research and measurement priorities for economists:
- Quantify long-run retention and productivity impacts from AI-assisted learning at scale (field experiments, longitudinal studies).
- Estimate the externality value of social/affective learning that human peers provide (how much future productivity is lost when novices skip peer interaction).
- Model optimal firm and educational policies balancing immediate efficiency and future human capital depreciation.

Practical recommendations for stakeholders - Educators: Adopt AI tools to improve throughput and relieve tedious syntax burdens, but require structured human interaction (pairing, peer review, reflective prompts, and assessments without AI) to protect learning and motivation. - Firms: Use AI to boost short-run coding speed, but preserve mentorship, pair-programming practices, and training rotations that maintain novice skill acquisition and team cohesion. - Economists and policymakers: Incorporate dynamic human capital effects (retention, motivation, social capital) into cost–benefit analyses of AI deployment in education and early-career workforces; prioritize longitudinal evidence.

If you want, I can extract a concise list of methodological limitations and robustness concerns from the paper (sample size, population, task set, Copilot version) and translate them into caveats for economic modeling.

Assessment

Paper Typerct Evidence Strengthmedium — The experimental design and within-subject comparisons provide credible causal evidence for immediate effects on performance and workload, but the small sample (N=22), short exposure (20 minutes per condition), potential order/learning effects, and imprecise retention results limit confidence and external validity. Methods Rigormedium — Study uses objective and subjective measures, incentives to balance speed and understanding, and a one-week retention test, but is constrained by a small sample, likely non-independence from team assignments, limited information about randomization/counterbalancing and statistical power, and short single-AI (Copilot) treatment. Sample22 participants (computer-science students) who wrote Python code; worked in teams of two for pair-programming and individually with GitHub Copilot for 20 minutes each, incentivized via bonus pay, and retested individually on the same tasks after a one-week retention interval; demographic details and prior experience levels not fully specified in summary. Themesproductivity human_ai_collab skills_training IdentificationWithin-subjects controlled lab experiment comparing the same participants' performance across two conditions (pair-programming with a human teammate vs. solo programming with GitHub Copilot) on time-limited Python tasks, using objective performance and subjective workload/emotion measures; causal claims rely on within-subject contrasts that hold constant individual ability and (presumably) counterbalance task/condition order to mitigate order effects. GeneralizabilitySmall lab sample of students — not representative of professional developers or broader labor force, Short, time-limited tasks (20 minutes) — may not reflect longer-term real-world project work, Single AI system (GitHub Copilot) — results may not generalize to other code-generation models or future versions, Pair composition and prior familiarity between teammates unspecified — interpersonal dynamics may differ in other settings, Retention evidence limited to a one-week interval and was not statistically significant — unclear long-term learning effects, Python-only tasks — may not generalize to other languages, domains, or non-programming tasks

Claims (10)

Claim	Direction	Confidence	Outcome	Details
We conducted a controlled pair programming study with 22 participants who wrote Python code under time pressure in teams of two and individually with GitHub Copilot for 20 minutes each. Other	null_result	high	study design / experimental conditions (teams of two vs individual with Copilot; 20 min per task)	n=22 1.0
Participants were incentivized by bonus compensation to balance performance with understanding. Other	null_result	high	incentive structure (bonus compensation)	n=22 1.0
Participants were retested individually on the programming tasks after a retention interval of one week. Other	null_result	high	retest performance after one-week retention interval	n=22 1.0
Participants performed significantly better with GitHub Copilot than with their human teammate. Developer Productivity	positive	high	programming performance on timed Python tasks	n=22 0.6
Several dimensions of participants' workload were significantly reduced when using GitHub Copilot. Worker Satisfaction	positive	high	subjective workload (multiple dimensions)	n=22 0.6
The emotional effect of the human teammate was significantly more positive and arousing compared to working with Copilot. Worker Satisfaction	positive	high	emotional valence and arousal during task	n=22 0.6
There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot). Skill Acquisition	negative	high	retest performance (learning retention) after one week	n=22 0.3
Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI. Training Effectiveness	mixed	high	educational practice recommendation (pair programming vs AI-assisted instruction)	n=22 0.1
Pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement. Training Effectiveness	positive	medium	self-efficacy and academic achievement associated with pair programming	0.36
Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years. Adoption Rate	positive	high	adoption/popularity of code-generating AI	0.3