In a 22-person lab experiment, GitHub Copilot raised immediate coding performance and reduced workload compared with human pair partners, but human partners produced more positive emotional experiences and possibly better one-week retention.
Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years. While research and pedagogy are beginning to cope with this change, computing students are left to bear the unforeseen consequences of AI amidst a dearth of empirical evidence about its effects. Though pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement, it remains underutilized and further threatened by the proposition that AI can replace a human programming partner. In this paper, we present a controlled pair programming study with 22 participants who wrote Python code under time pressure in teams of two and individually with GitHub Copilot for 20 minutes each. They were incentivized by bonus compensation to balance performance with understanding and were retested individually on the programming tasks after a retention interval of one week. Subjective measures of workload and emotion as well as objective measures of performance and learning (retest performance) were collected. Results showed that participants performed significantly better with GitHub Copilot than their human teammate, and several dimensions of their workload were significantly reduced. However, the emotional effect of the human teammate was significantly more positive and arousing as compared to working with Copilot. Furthermore, there was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition. We recommend that educators strongly consider revisiting pair programming as an educational tool in addition to embracing modern AI.
Summary
Main Finding
In a controlled within-subjects study of 22 novice/intermediate Python programmers, participants solved timed programming tasks significantly better and with lower workload when paired with GitHub Copilot than when paired with a human teammate. However, human partners produced more positive and arousing emotional experiences, and AI-assisted work showed a (nonsignificant) drop in one-week retention and a larger retest performance decrement than human-paired work. Authors conclude that AI gives short-term performance and workload benefits but may weaken affective, motivational, and possibly longer-term learning advantages of human-human pair programming.
Key Points
- Study design: within-subjects comparison (each participant experienced both paradigms) of human-human pair programming vs. human-AI (GitHub Copilot) programming; two 20-minute sessions on matched sets of short Python tasks, with a one-week retention/retest session.
- Primary outcomes collected: objective performance during sessions, learning retention (retest performance), subjective workload measures, and self-reported emotion (valence and arousal).
- Core results:
- Performance: Participants completed tasks significantly better with GitHub Copilot than with a human teammate under time pressure.
- Workload: Several dimensions of workload were significantly reduced when using Copilot.
- Emotion: Working with a human teammate elicited significantly more positive and higher-arousal emotional responses than working with Copilot.
- Retention: The AI condition showed a nonsignificant absolute reduction in one-week retest performance and a larger retest performance decrement relative to the human condition.
- Theoretical framing: Authors interpret results using Control-Value Theory (emotional valence/arousal and learning foci) and Cognitive Load Theory (intrinsic/germane/extraneous load). They hypothesize AI reduces extraneous/intrinsic load and immediate friction but may diminish beneficial conflict, social motivation, and metacognitive engagement that support deeper learning.
- Recommendation: Educators should both embrace AI tools for their efficiency gains and re-evaluate (or re-emphasize) human pair programming as an educational practice to preserve affective and learning benefits.
Data & Methods
- Participants: 22 volunteers (mostly undergraduate CS students; mean age ≈ 20.8; 16 male / 6 female; mixed ethnicities), screened for novice-to-intermediate Python ability and asked to abstain from Python study during the one-week retention interval.
- Technical setup: VS Code on Windows with two identical workstations. One IDE had GitHub Copilot (Copilot Chat using GPT-4.1) enabled; the other had standard Python IntelliSense. Browsing and use of other browser-based AI tools were disallowed and monitored.
- Tasks: Eight short Python problems drawn from the HumanEval dataset, organized into two matched 4-task pools (≈5 minutes per task, 20 minutes per condition). Problem prompts were provided as PDF (so Copilot could not automatically ingest them); participants had to supply context to Copilot via chat if desired. Copilot was verified to be capable of producing passing solutions for these tasks in one shot, but participants were not told this.
- Procedure and incentives: Each participant completed both conditions (human pair and Copilot) in counterbalanced order, writing code under time pressure and with a monetary/bonus incentive designed to balance speed/performance and understanding. After one week, participants were retested individually on the same tasks to measure retention. The IRB allowed limited deception for design integrity; participants were debriefed.
- Measures: Objective performance (task completion / correctness on automated tests), one-week retest performance (learning/retention), subjective workload (multi-dimensional), and self-reported emotional valence and arousal (Control-Value Theory perspective). Exact instruments (e.g., NASA-TLX or specific emotion scales) are described in the full paper; analyses reported significance tests comparing conditions.
Implications for AI Economics
- Short-run productivity vs. long-run human capital formation:
- AI tools like Copilot can raise immediate output and lower effort per task (productivity gains), which firms and educational institutions will find attractive.
- However, the evidence here suggests possible reductions in knowledge retention and metacognitive engagement when AI replaces human collaboration—implying potential long-term depreciation of novices' skills. Economic models of AI adoption should therefore account for dynamic trade-offs between short-run productivity and long-run human capital accumulation.
- Labor demand and skill composition:
- If AI reduces the routine problem-solving practice novices receive, future entry-level workers may be less prepared for tasks that require independent debugging, specification, or conceptual understanding. Demand may therefore shift toward workers who can supervise, audit, and augment AI, increasing premiums on higher-level metacognitive and collaborative skills.
- Complementarity and substitution:
- Findings underscore complementarity rather than pure substitution: combining AI assistance with human-led pedagogical practices (e.g., pair programming, mentorship) may capture both productivity and learning/affective benefits. Policy and firm-level decisions should favor hybrid arrangements that preserve social learning externalities.
- Welfare and workplace well-being:
- Human partners produced more positive and stimulating emotional experiences. For organizational outcomes tied to morale, retention, and creative collaboration, replacing interpersonal work with AI-only interactions could yield hidden costs (reduced engagement, team cohesion) that offset efficiency gains.
- Incentive design and assessment:
- Educational incentives and workplace evaluation need adjustment to avoid encouraging overreliance on AI (which can inflate short-term performance metrics). Assessments should measure retained competence and ability to perform without AI to discourage degradation of core skills.
- Policy and investment priorities:
- Investment in training curricula should explicitly teach skill maintenance under AI augmentation (e.g., debugging AI output, prompting for learning rather than solutions).
- Public policy and educational funding should recognize potential negative externalities of wholesale AI substitution on human capital and support programs that integrate AI while preserving interpersonal learning.
- Research and measurement priorities for economists:
- Quantify long-run retention and productivity impacts from AI-assisted learning at scale (field experiments, longitudinal studies).
- Estimate the externality value of social/affective learning that human peers provide (how much future productivity is lost when novices skip peer interaction).
- Model optimal firm and educational policies balancing immediate efficiency and future human capital depreciation.
Practical recommendations for stakeholders - Educators: Adopt AI tools to improve throughput and relieve tedious syntax burdens, but require structured human interaction (pairing, peer review, reflective prompts, and assessments without AI) to protect learning and motivation. - Firms: Use AI to boost short-run coding speed, but preserve mentorship, pair-programming practices, and training rotations that maintain novice skill acquisition and team cohesion. - Economists and policymakers: Incorporate dynamic human capital effects (retention, motivation, social capital) into cost–benefit analyses of AI deployment in education and early-career workforces; prioritize longitudinal evidence.
If you want, I can extract a concise list of methodological limitations and robustness concerns from the paper (sample size, population, task set, Copilot version) and translate them into caveats for economic modeling.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We conducted a controlled pair programming study with 22 participants who wrote Python code under time pressure in teams of two and individually with GitHub Copilot for 20 minutes each. Other | null_result | high | study design / experimental conditions (teams of two vs individual with Copilot; 20 min per task) |
n=22
1.0
|
| Participants were incentivized by bonus compensation to balance performance with understanding. Other | null_result | high | incentive structure (bonus compensation) |
n=22
1.0
|
| Participants were retested individually on the programming tasks after a retention interval of one week. Other | null_result | high | retest performance after one-week retention interval |
n=22
1.0
|
| Participants performed significantly better with GitHub Copilot than with their human teammate. Developer Productivity | positive | high | programming performance on timed Python tasks |
n=22
0.6
|
| Several dimensions of participants' workload were significantly reduced when using GitHub Copilot. Worker Satisfaction | positive | high | subjective workload (multiple dimensions) |
n=22
0.6
|
| The emotional effect of the human teammate was significantly more positive and arousing compared to working with Copilot. Worker Satisfaction | positive | high | emotional valence and arousal during task |
n=22
0.6
|
| There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot). Skill Acquisition | negative | high | retest performance (learning retention) after one week |
n=22
0.3
|
| Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI. Training Effectiveness | mixed | high | educational practice recommendation (pair programming vs AI-assisted instruction) |
n=22
0.1
|
| Pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement. Training Effectiveness | positive | medium | self-efficacy and academic achievement associated with pair programming |
0.36
|
| Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years. Adoption Rate | positive | high | adoption/popularity of code-generating AI |
0.3
|