Code-aware conversational assistants are reshaping developer workflows: rather than giving complete instructions up front, programmers iteratively refine AI outputs, offload diagnosis and validation tasks to models, and steer AI autonomy by embedding context and constraints into ongoing chats.

Programming by Chat: A Large-Scale Behavioral Analysis of 11,579 Real-World AI-Assisted IDE Sessions

Ningzhi Tang, Chaoran Chen, Zihan Fang, Gelei Xu, Maria Dhakal, Yiyu Shi, Collin McMillan, Yu Huang, Toby Jia-Jun Li · April 01, 2026

arxiv descriptive medium evidence 8/10 relevance Source PDF

Analysis of 74,998 real-world IDE-chat messages shows conversational, codebase-aware AI assistants lead developers to progressively specify tasks, delegate diagnostic and validation work to the AI, and actively manage the collaboration via persistent artifacts and context constraints.

IDE-integrated AI coding assistants, which operate conversationally within developers' working codebases with access to project context and multi-file editing, are rapidly reshaping software development. However, empirical investigation of this shift remains limited: existing studies largely rely on small-scale, controlled settings or analyze general-purpose chatbots rather than codebase-aware IDE workflows. We present, to the best of our knowledge, the first large-scale study of real-world conversational programming in IDE-native settings, analyzing 74,998 developer messages from 11,579 chat sessions across 1,300 repositories and 899 developers using Cursor and GitHub Copilot. These chats were committed to public repositories as part of routine development, capturing in-the-wild behavior. Our findings reveal three shifts in how programming work is organized: conversational programming operates as progressive specification, with developers iteratively refining outputs rather than specifying complete tasks upfront; developers redistribute cognitive work to AI, delegating diagnosis, comprehension, and validation rather than engaging with code and outputs directly; and developers actively manage the collaboration, externalizing plans into persistent artifacts, and negotiating AI autonomy through context injection and behavioral constraints. These results provide foundational empirical insights into AI-assisted development and offer implications for the design of future programming environments.

Summary

Main Finding

IDE-integrated conversational AI assistants (e.g., Cursor, GitHub Copilot Chat) reshape software workflows from one-shot coding to an iterative, dialogue-driven process in which developers progressively specify requirements and shift substantive cognitive work—diagnosis, comprehension, and validation—onto AI. Developers nonetheless actively manage AI autonomy (externalizing plans, injecting context, constraining behavior). These shifts produce distinct session archetypes (short, task-focused exchanges and a long tail of extended co-development/debugging), with important implications for productivity, skill demand, quality assurance, and market structure in software development.

Key Points

Large, ecologically valid dataset: 74,998 developer messages from 11,579 IDE-chat sessions across 1,300 public repositories and 899 developers (Sept 2024–Mar 2026).
Behavioral taxonomy: developed by iterative abductive coding — 7 main categories and 20 subcategories of developer intents (multi-label).
Three major behavioral shifts identified:
Progressive specification: developers steer AI via successive refinements instead of fully specifying tasks upfront.
Redistribution of cognitive work: developers report symptoms and ask the assistant to diagnose, comprehend, and validate code rather than doing these tasks themselves.
Active collaboration management: developers request persistent planning artifacts, inject project context, set explicit action constraints, and open new sessions to refresh context while preserving continuity.
Session-level structure: most sessions are short and focused; a long tail supports extended iterative refinement. Clustering of sessions (4,864 with ≥4 messages) yields six archetypes ranging from failure-driven debugging to extended co-development.
Multilingual and multi-domain: messages in >20 natural languages; dominant code languages TypeScript, Python, JavaScript, HTML. Repositories skew toward single-contributor, early-stage projects.
Methods validation: LLM-based multi-label classifier (GPT-5 mini) used to label messages; validated on 400 manually adjudicated messages (macro F1 = 0.802).

Data & Methods

Data source: Chat histories auto-exported to public GitHub repos by SpecStory; collected via GitHub Code Search and Blob APIs.
Filtering: Excluded CLI-agent sessions (e.g., Claude Code) to focus on IDE-native conversational workflows. Deduplicated by content hash; removed messages with no classifiable behavioral intent.
Final analytic dataset: 74,998 user messages, 11,579 sessions, 1,300 repos, 899 distinct developers.
Taxonomy development: 4 rounds of abductive coding, team consensus, structural saturation achieved (no new subcategories in final round).
Annotation pipeline:
- Multi-label classification using GPT-5 mini with schema-constrained JSON outputs.
- Each classification included the current user message plus the immediately preceding user/A.I. exchange (truncated via head-tail).
- Classifier justification required for robustness.
- Validation: stratified sample (20 messages per subcategory; 400 total) annotated by two researchers; adjudicated gold standard; classifier macro F1 = 0.802 (precision 0.774, recall 0.851).
Session analysis: represented sessions as ordered sequences of intent labels; applied hierarchy-aware edit-distance sequence clustering to identify archetypes and analyzed transition dynamics.
Limitations noted by authors: sample bias toward developers who commit chat logs publicly (SpecStory users), repository skew to small/personal projects, exclusion of CLI-agent paradigms, only user messages analyzed (not assistant outputs), and potential truncation/context limits.

Implications for AI Economics

Productivity & task composition
- Short run: assistants can raise developer throughput by automating routine implementation, debugging steps, and some validation tasks—yielding potential productivity gains per developer.
- Task decomposition shifts: higher share of incremental, interactive delegation (progressive specification) favors workflows where tasks can be broken into iterative prompts rather than large upfront specifications.
Labor demand and skill-biased changes
- Demand likely shifts away from routine, implementation-centric tasks toward oversight, verification, prompt engineering, system integration, and higher-level design/architecture.
- Reduced engagement with raw code (less reading/diagnosis) may slow human skill accumulation for junior developers, altering career progression and human capital formation.
- New complementary roles increase demand: testers, security auditors, AI-verification engineers, and QA teams focused on validating AI-generated code.
Quality, risk, and transaction costs
- Delegation of diagnosis and validation to AI could raise downstream defect risk if human oversight is weak (“vibe coding” externality). This raises expected costs of bug discovery, security incidents, and rework.
- Firms may increase investment in automated testing, formal verification, and auditing tools—creating markets for verification and compliance services.
- Liability and IP: AI-generated code provenance and licensing issues create legal and transaction-cost frictions; firms may demand audit logs, provenance tools, and indemnity clauses from AI tool providers.
Market structure & value capture
- IDE-native assistants that leverage repository context and tool access offer stronger product differentiation (higher task-specific utility), increasing platform lock-in and potential winner-takes-most dynamics.
- Value accrues to firms that provide tight IDE integration, high-quality context-aware models, and workflow tooling (e.g., persistent planning artifacts, context-injection features).
- Developers and firms may prefer paid, integrated assistants over general chatbots—shifting spending toward platform subscriptions, enterprise features, and ecosystem lock-in.
Measurement and macro implications
- Conventional productivity statistics may under- or over-estimate outputs: AI contributions are invisible in traditional metrics of developer hours or lines of code; measuring true labor productivity requires new metrics (e.g., validated feature throughput, defect-adjusted output).
- Potential for labor displacement at the margin—but with heterogeneous effects: more junior/entry-level coding tasks are more automatable, while senior roles emphasizing oversight/architecture remain complementary.
Policy and organizational responses
- Firms should establish governance: mandatory review steps, provenance logging, role-based constraints on AI autonomy, and investment in verification/testing to internalize risk externalities.
- Training and workforce development should pivot to emphasize validation skills, AI oversight, prompt design, and system-level engineering.
Directions for economists/researchers
- Quantify causal effects on productivity, wages, and employment using longitudinal or quasi-experimental designs (e.g., firm adoption shocks, subscription rollouts).
- Model complementarities between AI tools and human skills to predict occupational reallocation and wage dynamics.
- Evaluate the social welfare trade-offs: short-term efficiency vs. long-term human capital erosion and externalities from potential increases in defective codebases.
- Study pricing and market competition among IDE-integrated assistants, focusing on lock-in, switching costs, and data/network effects from project-context access.

Short actionable takeaway: conversational, IDE-native AI assistants are changing what programmers do (more steering and oversight, less low-level coding), creating productivity upside but also new verification costs, shifting skill demand, and concentrating value with integrated platform providers—so firms and policymakers should invest in governance, measurement, and workforce re-skilling while economists measure labor-market and quality externalities.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Uses a large, novel, real-world dataset of IDE-native conversational interactions (74,998 messages across 11,579 sessions, 1,300 repos, 899 developers), which provides credible descriptive evidence about how developers use code-aware assistants; however, it is observational with no counterfactual or causal identification, subject to selection and measurement biases, and does not measure downstream productivity or economic outcomes directly. Methods Rigormedium — Study appears to combine large-scale log analysis with qualitative interpretation of conversational patterns — a strong approach for generating, characterizing, and triangulating behavioral motifs — but the paper lacks causal designs, may rely on subjective coding/annotation without reported inter-rater reliability here, and does not link behaviors to objective performance metrics or rule out important selection confounders. Sample74,998 developer messages from 11,579 chat sessions across ~1,300 public repositories and 899 developers using Cursor and GitHub Copilot; chats were captured from IDE-integrated, codebase-aware conversational assistants and were committed to public repositories as part of routine development (timeframe not specified). Themeshuman_ai_collab productivity org_design adoption GeneralizabilityPublic-repository bias: excludes private/enterprise codebases and internal workflows., Self-selection and early-adopter bias: users of Cursor and Copilot Chat may not represent typical developers., Tool-specific: findings may not generalize to other conversational assistants or non-IDE workflows., Committed-chat subset: only interactions saved to repos are observed, missing ephemeral/useful interactions., Language/project bias: may overrepresent particular programming languages, frameworks, or repo sizes., Unknown demographics/geography: limited information on developer experience, firm size, or region.

Claims (6)

Claim	Direction	Confidence	Outcome	Details
We present, to the best of our knowledge, the first large-scale study of real-world conversational programming in IDE-native settings. Other	mixed	medium	existence/novelty of a large-scale empirical study of IDE-native conversational programming	n=74998 0.02
We analyze 74,998 developer messages from 11,579 chat sessions across 1,300 repositories and 899 developers using Cursor and GitHub Copilot. Other	null_result	high	number of developer messages / chat sessions / repositories / developers analyzed	n=74998 0.3
These chats were committed to public repositories as part of routine development, capturing in-the-wild behavior. Other	null_result	high	degree to which collected chats represent in-the-wild developer behavior (public repository commits)	n=74998 0.3
Conversational programming operates as progressive specification, with developers iteratively refining outputs rather than specifying complete tasks upfront. Task Allocation	mixed	high	mode of task specification (iterative refinement vs complete upfront specification)	n=74998 0.18
Developers redistribute cognitive work to AI, delegating diagnosis, comprehension, and validation rather than engaging with code and outputs directly. Task Allocation	mixed	high	allocation of cognitive tasks (diagnosis, comprehension, validation) between developers and AI	n=74998 0.18
Developers actively manage the collaboration, externalizing plans into persistent artifacts, and negotiating AI autonomy through context injection and behavioral constraints. Organizational Efficiency	mixed	high	practices for managing AI collaboration (externalization of plans, context injection, behavioral constraints)	n=74998 0.18