A large study of 6,000 live coding-agent sessions finds agents either write almost all or none of committed code — 41% 'vibe coding' versus 23% human-only — yet only 44% of agent-produced code survives into commits and agent contributions carry more security flaws, with users pushing back in 44% of interactions.

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo · April 22, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Using a new 6,000-session SWE-chat dataset of real-world coding-agent interactions, the authors show coding behavior is bimodal (41% agent-dominant, 23% human-only), only 44% of agent-generated code survives into commits, agent-authored code has higher measured vulnerability rates, and users correct or interrupt agents in 44% of turns.

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

Summary

Main Finding

SWE-chat is the first large-scale, continuously growing dataset of real-world coding agent sessions (opt-in by developers) that links full agent interaction traces to git commits with line-level human vs. agent authorship. Using SWE-chat (≈6k sessions, 63k+ user prompts, 355k+ agent tool calls at the time of reporting), the authors show that agent usage is highly bimodal: many sessions are either “vibe coding” (agent authors ≳99% of committed code) or human-only, and agent outputs are frequently discarded or corrected. Overall, only ~44% of agent-produced code survives into commits; vibe coding is increasing but is less efficient, slower, and introduces substantially more security findings per committed line than human-authored or collaboratively authored code.

Key Points

Dataset scale and scope
- Currently ≈6,000 sessions from 200+ public GitHub repos, 63k+ user prompts, 355k+ agent tool calls, and 2.7M logged events overall; collection is continuous (living dataset) via the Entire.io CLI opt-in.
- Logged agents include Claude Code, OpenCode, Gemini CLI, Cursor, Factory AI Droid (≈85% Claude Code in practice).
- Each session links agent tool-call traces to git diffs with line-level authorship attribution.
High-level usage patterns (RQ1)
- Task diversity: the most common specific user intent is “understand existing code” (~19% of prompts); creating new code ~13%; git ops and debugging are common.
- Tool-use composition: ~33% of agent tool calls are bash (mostly git), ~48% are file read/edit/search operations—agents do much more than just generate patches.
- User personas: many users act as “expert nitpickers” (meticulous corrections) even when letting agents write code.
Coding modes and trend
- Three coding modes: human-only (22.7% of sessions), collaborative (36.5%), vibe coding (40.8%).
- Vibe coding share doubled over a ≈3-month window (from ~20% to >40%).
Failure modes and user responses (RQ2)
- Session success: most sessions rate as largely successful (median success ≈82%), but failures occur (interruptions, irrelevant outputs, run-time/resource limits).
- User pushback: users push back (corrections, failure reports, rejections) in a large fraction of turns (≈39–44% in different metrics); interrupts occur in ~5% of turns.
- Agents rarely ask clarification questions—even for long-running autonomous actions.
Efficiency, cost, and safety
- Code survival: overall ~44.3% of agent-produced code survives into user commits; vibe coding has higher survival (~59%) but may reflect lower scrutiny.
- Cost/tokens/time: vibe coding is much less efficient—median ~204k tokens per 100 committed lines, ~3× the token/cost per committed line vs collaborative mode; median dollar cost per 100 committed lines: $0.13 (vibe) vs $0.07 (human-only) vs $0.05 (collaborative) as reported.
- Safety: vibe-coded commits introduce substantially more security findings—roughly 9× more vulnerabilities per committed line than pure human-authored code and ~5× more than co-authored code (measured by running Semgrep on pre/post commit snapshots).
Annotation & validation
- Sessions and prompts are annotated (intent, session success, pushback, personas) using LLM-based judges validated against human labels; inter-annotator agreement reported moderate-to-high and details in appendices.
- Metrics defined include code survival rate, coding efficiency, token/cost/time per committed line.
Limitations
- Opt-in, early-adopter population; heavy skew toward one agent (Claude Code); public repos only — may not generalize across all developers or enterprise settings.
- LLM annotators introduce labeling noise despite validation.

Data & Methods

Data collection
- Entire.io CLI installed by opt-in developers records full agent session transcripts to a dedicated branch and links checkpoints to commits.
- Captured events: user prompts, agent text responses, tool calls (file read/edit, shell commands, searches), token usage, streamed progress events, and some reasoning traces.
Linking to code
- Automatic sync of agent logs with git diffs provides line-level attribution of who authored each line in commits (agent vs human).
Annotation pipeline
- Annotation tasks include session success (0–100), user persona, prompt intent, and user pushback categories.
- LLM “judges” (selected and validated for best zero-shot performance) annotate the full dataset; human labels used for validation and inter-annotator agreement checks.
Evaluation metrics
- Code survival rate: fraction of agent-produced code that survives into commits (with breakdown into agent self-overwrites, human overwrites, human deletions).
- Coding efficiency: fraction of agent effort that ends up in the commit.
- Costs: tokens and dollar-costs per 100 committed lines; time per 100 committed lines; agent runtime.
- Safety: run Semgrep on pre- and post-commit snapshots to count security findings introduced per committed line.

Implications for AI Economics

Productivity vs. quality trade-off
- Although agents can author large fractions of code (vibe coding), survival and safety metrics show quality-adjusted productivity is lower: much agent effort is discarded or introduces vulnerabilities. Economic analyses of agent adoption must account for quality-adjusted output, not raw lines generated.
Cost accounting must include downstream remediation
- Token/runtime costs are small per committed line but increased vulnerability rates imply added downstream costs (security reviews, bug fixes, incident response). Firms should model expected remediation costs and liability when estimating ROI from agent deployment.
Complementarity, not full substitution (for now)
- The prevalence of “expert nitpicker” behavior and frequent user pushback indicates that developers remain essential for steering, vetting, and correcting agent output. Labor models should emphasize augmentation (task reallocation, skill shifts) rather than immediate large-scale labor displacement.
Incentives for interaction design and procurement
- Agent interfaces that prompt clarifying questions, surface provenance, and reduce unnecessary autonomous work could improve economic efficiency (reduce token waste and vulnerability risk). Procurement decisions should favor agents and integrations that optimize interaction efficiency and safety.
Market and policy implications
- Increasing adoption of vibe coding suggests demand for tools that certify or audit agent outputs (security scanners, provenance tracking, insurance products). Regulators and insurers may need metrics like survival rate and vulnerability-introduction rates to assess systemic risk and to price cyber-insurance.
Benchmarking and R&D priorities
- Benchmarks focused narrowly on patch generation understate operational value and costs. Economics-driven R&D should fund work that improves agent comprehension, clarification behavior, and collaborative workflows (these reduce wasted effort and security costs).
Data-driven forecasting
- Living datasets like SWE-chat enable dynamic measurement of agent adoption, per-unit cost, and quality trends over time—critical inputs for modeling labor market impacts, firm-level adoption curves, and macro-level productivity estimates.

If you want, I can: - Extract a compact table of the most salient metrics (survival rate, cost per 100 lines, vuln multipliers) for quick reference. - Produce a short list of concrete experiments or product changes (e.g., ask-before-edit rates, clarification prompts) that would likely improve economic efficiency according to the paper’s findings.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Large-scale, real-world dataset (6,000 sessions, 63k prompts, 355k tool calls) gives strong descriptive evidence about how coding agents are used and the quality of their outputs, but the study is observational with no causal identification and subject to selection and measurement biases (e.g., which sessions are discoverable, authorship attribution, vulnerability labeling), so conclusions about broader impacts or causal effects are limited. Methods Rigormedium — The authors implement an automated, ongoing collection pipeline and provide comprehensive interaction traces with authorship attribution, enabling systematic descriptive analysis at scale; however, the paper likely relies on heuristics for attributing code to agents vs humans and for detecting vulnerabilities, lacks randomized or quasi-experimental variation, and may not fully validate labeling accuracy or address survivorship/selection biases. SampleSWE-chat: an automatically discovered, continuously updated corpus of ~6,000 real-world coding-agent sessions from public/open-source repositories, containing >63,000 user prompts and ~355,000 agent tool calls; sessions are from developers 'in the wild' (public commits) and include interaction traces plus commit-level authorship annotation. Themesproductivity human_ai_collab adoption skills_training GeneralizabilityLimited to public/open-source repositories — excludes enterprise, private, and non-committed workflows, Selection bias toward developers who expose agent interactions in public commits and toward agents/tools that integrate with public platforms, Potential language, tech-stack, and project-size skew (e.g., certain languages or repos more likely to use agents), Time-bound snapshot — agent capabilities and usage patterns evolve rapidly, Possible measurement error in attributing authorship and in vulnerability detection limits external validity

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. Other	positive	high	existence and scale of the SWE-chat dataset (novel dataset release)	n=6000 0.18
The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. Other	positive	high	dataset size (sessions, prompts, agent tool calls)	n=6000 more than 63,000 user prompts; 355,000 agent tool calls 0.3
SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Other	positive	high	dataset collection process (automated, continual discovery from public repositories)	n=6000 0.18
Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Adoption Rate	mixed	high	distribution of code authorship across sessions (agent-dominant vs human-only sessions)	n=6000 41% of sessions (agent-dominant); 23% of sessions (human-only) 0.18
Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Developer Productivity	negative	high	overall agent efficiency in natural developer workflows (qualitative synthesis)	n=6000 0.18
Just 44% of all agent-produced code survives into user commits. Output Quality	negative	high	survival/usefulness of agent-produced code (proportion incorporated into commits)	n=6000 Just 44% of all agent-produced code 0.18
Agent-written code introduces more security vulnerabilities than code authored by humans. Output Quality	negative	high	security vulnerabilities introduced by agent-written code versus human-written code	n=6000 0.18
Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. Worker Satisfaction	negative	high	rate of user pushback per interaction turn	n=63000 44% of all turns 0.18
By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows. Research Productivity	positive	medium	utility of SWE-chat for empirical research and benchmark improvement	n=6000 0.02