Generative AI helps break user stories into finer, more complete task lists but can’t yet replace human planners; developers prefer hybrid workflows combining GitLab Duo suggestions with manual review.

Splitting User Stories Into Tasks with AI -- A Foe or an Ally?

Luka Pavlič, Reinhard Bernsteiner, Stephan Schlögl, Christian Ploder · May 08, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

In a controlled experiment, AI-assisted task splitting with GitLab Duo produced more granular and more complete task lists than traditional methods, but participants preferred a hybrid workflow because the AI still generated irrelevant or incorrect items requiring human oversight.

In agile software development, breaking down user stories into actionable tasks is a critical yet time-consuming process. This paper investigates the potential of Generative AI tools to assist in task splitting, aiming to enhance planning efficiency. We conducted a controlled experiment comparing traditional task-splitting methods with AI-assisted approaches using GitLab Duo. Our findings indicate that while current AI tools are not yet mature enough to replace developers, they can aid in generating more granular task lists and ensuring no important tasks are overlooked. Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning. This study highlights the potential benefits and limitations of integrating Generative AI into agile development processes, suggesting that AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.

Summary

Main Finding

Generative AI (here: GitLab Duo) cannot yet replace developers for task-splitting in agile planning but can be a useful assistive tool. AI-generated breakdowns are more granular and surface tasks humans sometimes omit (e.g., tests, docs, refactors), but they also produce irrelevant or context-insensitive tasks. Participants strongly prefer a hybrid workflow where AI suggestions are curated by humans.

Key Points

Controlled experiment: AI-assisted (GitLab Duo) vs conventional task-splitting.
Sample: 42 higher-year CS students (39 in final analysis), teams of 3, 6 experimental teams, 7 control teams.
Workload: each team given 8 user stories (total 48 split by experimental, 56 by control).
Output differences:
- AI group produced 260 tasks (avg 5.4 tasks/user story).
- Control group produced 184 tasks (avg 3.2 tasks/user story).
- Teams implementing AI-generated plans implemented on average 59% of generated tasks; teams using manual plans implemented all their tasks.
- AI lists commonly included testing, documentation, and refactoring tasks that manual lists sometimes missed, but also unrelated items.
Perceptions:
- 100% of participants preferred a hybrid AI+human approach for future task splitting.
- Only 10% judged AI as producing more relevant tasks than conventional methods; 55% disagreed, 35% were uncertain.
Additional note from the authors’ earlier related work: automated effort estimation by generative AI in this context showed poor accuracy (≈16%).

Data & Methods

Design: One-factor controlled experiment comparing two task-splitting methods (conventional vs GitLab Duo).
Participants: Voluntary, experienced students (91% had 1–3 years dev experience); 82% had prior exposure to AI dev tools.
Procedure: Three-session simulated sprint:
Setup + pre-test + distribution to groups; teams created task lists from provided user stories.
Implementation of selected user stories; progress tracked in GitLab.
Acceptance testing + post-test questionnaire.
Measures collected:
- Number and content of generated tasks per user story/team.
- Implementation outcomes (which generated tasks were executed).
- Participant attitudes via pre/post questionnaires.
Key quantitative results: 260 vs 184 tasks; 5.4 vs 3.2 tasks per story; 59% implementation rate for AI-generated tasks.
Threats to validity noted by authors: small/short controlled setting; student sample (not full-time industry devs); single AI tool (GitLab Duo) and no domain fine-tuning; rapid evolution of generative AI could change results.

Implications for AI Economics

Complementarity, not substitution (for now): AI increases task granularity and surfaces neglected activities, implying it complements developers by augmenting planning comprehensiveness. However, imperfect relevance means human oversight remains essential, so AI shifts (rather than eliminates) labor toward verification, curation, and higher-level coordination.
Productivity and taskization effects: More granular task lists could (a) reduce missed work and rework (raising effective throughput/quality) and (b) increase apparent administrative overhead (more issues to track), potentially changing how labor is allocated across roles (more time spent triaging/closing micro-tasks). Net productivity gains depend on the balance between avoided rework and added task-management friction.
Skill-biased demand: Adoption favors workers skilled at prompt engineering, AI oversight, prioritization, and integrating AI outputs—skills that may command a wage premium. Routine estimation or low-level decomposition tasks may decline in value relative to supervisory and integrative skills.
Platform- and tool-specific lock-in and market power: Integration of AI assistants into development platforms (e.g., GitLab Duo) can strengthen platform lock-in. Firms may face switching costs as AI workflows and artifacts become embedded in process tooling; market power implications deserve attention when platforms bundle increasingly capable AI assistants.
Measurement and valuation challenges: Firms should not equate more tasks with greater work value. Economic assessments need to measure time-to-delivery, defect rates, rework, and managerial overhead to establish whether AI-assisted splitting delivers cost savings or just more tracked tasks.
Transition dynamics and policy considerations: Short-term labor impacts are likely modest because human oversight is required. Over time, as LLMs improve and can be fine-tuned for domain/context, some lower-skill planning tasks could be automated—affecting entry-level roles and internship training. Policies and firm strategies should emphasize reskilling toward oversight, tooling integration, and quality assurance.
Research priorities for applied AI economics:
- Field experiments in industry settings measuring time saved, defect reduction, and rework avoidance.
- Cost–benefit analyses accounting for increased task counts and curation costs.
- Studies on wage effects for roles that supervise or integrate AI outputs.
- Investigation of platform competition and lock-in as AI assistants proliferate.

Summary takeaway: Generative AI is an effective augmenting technology in agile task decomposition—improving coverage and granularity but introducing noise—so its economic impact will primarily be through complementarity (reshaping tasks and skills), changes in productivity conditional on oversight costs, and platform-dependent adoption dynamics.

Assessment

Paper Typerct Evidence Strengthmedium — Experimental design gives reasonably strong internal validity for short-term effects on task-splitting outcomes (granularity, omissions), but strength is limited by likely small/unclear sample size, single-tool treatment (GitLab Duo), lab-style tasks rather than field deployment, and absence of long-run productivity/wage/firm-level outcomes. Methods Rigormedium — Use of a controlled experiment is a rigorous choice, but the paper appears to rely on a single AI tool, details on randomization, sample size, participant selection, blinding, and measurement protocols are not provided here; ecological validity and external robustness checks are limited. SampleParticipants performed task-splitting on provided agile user stories in a controlled experiment; the AI condition used GitLab Duo to generate task lists while the control used conventional methods—paper does not report detailed sample size/composition or whether participants were professional developers vs. students in the summary provided. Themesproductivity human_ai_collab org_design IdentificationControlled between-subjects experiment with participants assigned to AI-assisted (GitLab Duo) vs traditional task-splitting conditions and outcomes compared across groups to infer causal effects of the AI intervention. GeneralizabilityResults pertain to a single Generative AI tool (GitLab Duo) and may not generalize to other models or prompt designs, Lab/controlled task setting — effects may differ in real-world team workflows and longer planning horizons, Unclear participant composition (students vs. experienced developers), limiting inference to broader developer populations, Limited task domain — findings may not hold across different types of projects, languages, or company processes, Short-term outcome measures (task granularity, omission) — not validated against downstream productivity or delivery outcomes

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Breaking down user stories into actionable tasks is a critical yet time-consuming process in agile software development. Task Completion Time	negative	high	time required to split user stories (descriptive claim about time consumption)	0.3
We conducted a controlled experiment comparing traditional task-splitting methods with AI-assisted approaches using GitLab Duo. Other	null_result	high	method comparison (experimental design)	1.0
Current AI tools are not yet mature enough to replace developers. Job Displacement	negative	high	suitability of AI to replace developers	0.6
AI-assisted approaches can generate more granular task lists than traditional methods. Output Quality	positive	high	task list granularity	0.6
AI-assisted approaches can help ensure no important tasks are overlooked during task-splitting. Error Rate	positive	high	task omission rate / completeness of task lists	0.6
Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning. Output Quality	positive	high	participant preference for planning approach / planning accuracy	0.6
AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks. Developer Productivity	positive	high	effectiveness of AI-assisted task-splitting under human oversight	0.6
Integrating Generative AI into agile development processes has potential benefits and limitations for planning efficiency. Organizational Efficiency	mixed	high	planning efficiency (benefits and limitations)	0.6