A large study of 6,000 live coding-agent sessions finds agents either write almost all or none of committed code — 41% 'vibe coding' versus 23% human-only — yet only 44% of agent-produced code survives into commits and agent contributions carry more security flaws, with users pushing back in 44% of interactions.
AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.
Summary
Main Finding
SWE-chat is the first large-scale, continuously growing dataset of real-world coding agent sessions (opt-in by developers) that links full agent interaction traces to git commits with line-level human vs. agent authorship. Using SWE-chat (≈6k sessions, 63k+ user prompts, 355k+ agent tool calls at the time of reporting), the authors show that agent usage is highly bimodal: many sessions are either “vibe coding” (agent authors ≳99% of committed code) or human-only, and agent outputs are frequently discarded or corrected. Overall, only ~44% of agent-produced code survives into commits; vibe coding is increasing but is less efficient, slower, and introduces substantially more security findings per committed line than human-authored or collaboratively authored code.
Key Points
-
Dataset scale and scope
- Currently ≈6,000 sessions from 200+ public GitHub repos, 63k+ user prompts, 355k+ agent tool calls, and 2.7M logged events overall; collection is continuous (living dataset) via the Entire.io CLI opt-in.
- Logged agents include Claude Code, OpenCode, Gemini CLI, Cursor, Factory AI Droid (≈85% Claude Code in practice).
- Each session links agent tool-call traces to git diffs with line-level authorship attribution.
-
High-level usage patterns (RQ1)
- Task diversity: the most common specific user intent is “understand existing code” (~19% of prompts); creating new code ~13%; git ops and debugging are common.
- Tool-use composition: ~33% of agent tool calls are bash (mostly git), ~48% are file read/edit/search operations—agents do much more than just generate patches.
- User personas: many users act as “expert nitpickers” (meticulous corrections) even when letting agents write code.
-
Coding modes and trend
- Three coding modes: human-only (22.7% of sessions), collaborative (36.5%), vibe coding (40.8%).
- Vibe coding share doubled over a ≈3-month window (from ~20% to >40%).
-
Failure modes and user responses (RQ2)
- Session success: most sessions rate as largely successful (median success ≈82%), but failures occur (interruptions, irrelevant outputs, run-time/resource limits).
- User pushback: users push back (corrections, failure reports, rejections) in a large fraction of turns (≈39–44% in different metrics); interrupts occur in ~5% of turns.
- Agents rarely ask clarification questions—even for long-running autonomous actions.
-
Efficiency, cost, and safety
- Code survival: overall ~44.3% of agent-produced code survives into user commits; vibe coding has higher survival (~59%) but may reflect lower scrutiny.
- Cost/tokens/time: vibe coding is much less efficient—median ~204k tokens per 100 committed lines, ~3× the token/cost per committed line vs collaborative mode; median dollar cost per 100 committed lines: $0.13 (vibe) vs $0.07 (human-only) vs $0.05 (collaborative) as reported.
- Safety: vibe-coded commits introduce substantially more security findings—roughly 9× more vulnerabilities per committed line than pure human-authored code and ~5× more than co-authored code (measured by running Semgrep on pre/post commit snapshots).
-
Annotation & validation
- Sessions and prompts are annotated (intent, session success, pushback, personas) using LLM-based judges validated against human labels; inter-annotator agreement reported moderate-to-high and details in appendices.
- Metrics defined include code survival rate, coding efficiency, token/cost/time per committed line.
-
Limitations
- Opt-in, early-adopter population; heavy skew toward one agent (Claude Code); public repos only — may not generalize across all developers or enterprise settings.
- LLM annotators introduce labeling noise despite validation.
Data & Methods
-
Data collection
- Entire.io CLI installed by opt-in developers records full agent session transcripts to a dedicated branch and links checkpoints to commits.
- Captured events: user prompts, agent text responses, tool calls (file read/edit, shell commands, searches), token usage, streamed progress events, and some reasoning traces.
-
Linking to code
- Automatic sync of agent logs with git diffs provides line-level attribution of who authored each line in commits (agent vs human).
-
Annotation pipeline
- Annotation tasks include session success (0–100), user persona, prompt intent, and user pushback categories.
- LLM “judges” (selected and validated for best zero-shot performance) annotate the full dataset; human labels used for validation and inter-annotator agreement checks.
-
Evaluation metrics
- Code survival rate: fraction of agent-produced code that survives into commits (with breakdown into agent self-overwrites, human overwrites, human deletions).
- Coding efficiency: fraction of agent effort that ends up in the commit.
- Costs: tokens and dollar-costs per 100 committed lines; time per 100 committed lines; agent runtime.
- Safety: run Semgrep on pre- and post-commit snapshots to count security findings introduced per committed line.
Implications for AI Economics
-
Productivity vs. quality trade-off
- Although agents can author large fractions of code (vibe coding), survival and safety metrics show quality-adjusted productivity is lower: much agent effort is discarded or introduces vulnerabilities. Economic analyses of agent adoption must account for quality-adjusted output, not raw lines generated.
-
Cost accounting must include downstream remediation
- Token/runtime costs are small per committed line but increased vulnerability rates imply added downstream costs (security reviews, bug fixes, incident response). Firms should model expected remediation costs and liability when estimating ROI from agent deployment.
-
Complementarity, not full substitution (for now)
- The prevalence of “expert nitpicker” behavior and frequent user pushback indicates that developers remain essential for steering, vetting, and correcting agent output. Labor models should emphasize augmentation (task reallocation, skill shifts) rather than immediate large-scale labor displacement.
-
Incentives for interaction design and procurement
- Agent interfaces that prompt clarifying questions, surface provenance, and reduce unnecessary autonomous work could improve economic efficiency (reduce token waste and vulnerability risk). Procurement decisions should favor agents and integrations that optimize interaction efficiency and safety.
-
Market and policy implications
- Increasing adoption of vibe coding suggests demand for tools that certify or audit agent outputs (security scanners, provenance tracking, insurance products). Regulators and insurers may need metrics like survival rate and vulnerability-introduction rates to assess systemic risk and to price cyber-insurance.
-
Benchmarking and R&D priorities
- Benchmarks focused narrowly on patch generation understate operational value and costs. Economics-driven R&D should fund work that improves agent comprehension, clarification behavior, and collaborative workflows (these reduce wasted effort and security costs).
-
Data-driven forecasting
- Living datasets like SWE-chat enable dynamic measurement of agent adoption, per-unit cost, and quality trends over time—critical inputs for modeling labor market impacts, firm-level adoption curves, and macro-level productivity estimates.
If you want, I can: - Extract a compact table of the most salient metrics (survival rate, cost per 100 lines, vuln multipliers) for quick reference. - Produce a short list of concrete experiments or product changes (e.g., ask-before-edit rates, clarification prompts) that would likely improve economic efficiency according to the paper’s findings.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. Other | positive | high | existence and scale of the SWE-chat dataset (novel dataset release) |
n=6000
0.18
|
| The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. Other | positive | high | dataset size (sessions, prompts, agent tool calls) |
n=6000
more than 63,000 user prompts; 355,000 agent tool calls
0.3
|
| SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Other | positive | high | dataset collection process (automated, continual discovery from public repositories) |
n=6000
0.18
|
| Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Adoption Rate | mixed | high | distribution of code authorship across sessions (agent-dominant vs human-only sessions) |
n=6000
41% of sessions (agent-dominant); 23% of sessions (human-only)
0.18
|
| Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Developer Productivity | negative | high | overall agent efficiency in natural developer workflows (qualitative synthesis) |
n=6000
0.18
|
| Just 44% of all agent-produced code survives into user commits. Output Quality | negative | high | survival/usefulness of agent-produced code (proportion incorporated into commits) |
n=6000
Just 44% of all agent-produced code
0.18
|
| Agent-written code introduces more security vulnerabilities than code authored by humans. Output Quality | negative | high | security vulnerabilities introduced by agent-written code versus human-written code |
n=6000
0.18
|
| Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. Worker Satisfaction | negative | high | rate of user pushback per interaction turn |
n=63000
44% of all turns
0.18
|
| By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows. Research Productivity | positive | medium | utility of SWE-chat for empirical research and benchmark improvement |
n=6000
0.02
|