Next‑generation software‑engineering agents need triadic training data — synchronized human–human context, human–AI interactions, and long‑horizon multi‑engineer work — not bigger GitHub scrapes or solo-agent traces; the authors recommend expert stimulated‑recall recordings and simulated cross‑functional teams plus mechanical verification, corpus characterization, probe experiments, and pre‑registered blind evaluation as minimum quality standards.
Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field's near-term research agenda should include it explicitly.
Summary
Main Finding
Frontier software-engineering (SWE) agents fail on long-horizon, cross-functional engineering work because current training data misses the human–human conversations that create engineering context. The paper argues the high-leverage remedy is "triadic data"—synchronized captures of (human–human deliberation, human–AI sessions, and the multi-week work that surrounds them)—collected primarily via two products (long-horizon expert trajectories with stimulated recall; simulated cross‑functional companies) and validated through a four‑tier evidence framework. The author estimates such a corpus is capturable in 12–18 months and that this data is the critical input to close the short-vs-long horizon capability gap.
Key Points
- The short‑horizon problem is largely solved by frontier models; long‑horizon tasks (hours to weeks) remain where models fail. Representative numbers: near‑100% success for tasks under ~4 minutes vs <10% for tasks >4 hours (METR); SWE‑EVO shows best models at ~25% on long tasks vs ~73% on standard short benchmarks.
- Existing data sources are insufficient:
- GitHub scrapes capture artifacts (code) not deliberation and show diminishing marginal returns and leakage/quality issues.
- Solo-agent self‑trajectories risk amplifying existing failure modes (cold-start, bias).
- Dyadic human–AI logs are necessary but omit the upstream human–human deliberation that forms context.
- Triadic frame: the unit of training data should be the human–human–AI triad. Three configurations:
- A: Pair-with-AI (real-time triad; meta-conversation between humans about using the AI).
- B: Human–human-then-AI (deliberation first, then AI-assisted implementation; captures context transfer/loss).
- C: Human–human-around-AI (longitudinal, multi-channel, cross‑functional work across days/weeks).
- Two complementary data products:
- Long‑horizon expert trajectories using instrumented capture plus stimulated‑recall walkthroughs to produce timestamped IDE events, screen/video, recall transcripts, final diffs, and step annotations (focused on multi-hour sessions; cost ~ $1,000 per 4‑hour trajectory).
- Simulated cross‑functional companies: fictional companies staffed by real senior contributors working 1–3 week deliverables with full channel instrumentation (Slack, calls, design tools, Git/GitHub) to capture negotiation, evolving requirements, and linkages across artifacts.
- Triadic data properties: synchrony (time-aligned multimodal capture), consent & capture-time sanitization (automated NER/redaction + human verification; simulated companies reduce IP issues), and ambiguity tolerance (preserve inter‑rater disagreement rather than majority-voting away ambiguity).
- Four‑tier evidence framework for corpus credibility:
- T1 Mechanical verification: compile, parse AST, tests, leakage/license/PII scans, conversational coherence.
- T2 Statistical characterization: distributions by language/framework/difficulty, contamination audits, embedding/clustering diversity metrics.
- T3 Probe experiments: small‑model fine‑tuning, learning curves, ablations to show signal.
- T4 Pre‑registered blind evaluation: holdout projects with pre-registered tasks and blind downstream assessment to measure true uplift.
- The paper does not reject dyadic data collection, but positions triadic capture as the missing high‑value substrate for long‑horizon SWE training.
Data & Methods
- Capture modalities: sub‑second aligned audio per participant, screen/IDE telemetry, terminal output, structured action streams, video, transcripts, recall audio/text synchronized to events.
- Stimulated‑recall protocol: passive instrumented session followed within ~30 minutes by a participant watching segments and verbalizing reasoning (reduces distortion vs concurrent think‑aloud while producing richer traces).
- Simulated companies: recruit senior practitioners in their actual roles; run multi‑week projects under consent; instrument all channels; fictionalize product/customer to mitigate IP risk; retain links across channels (Slack → design doc → ticket → PR → commit).
- Sanitization approach: on‑capture redaction overlays for hostnames/functions/customer names, automated ASR diarization and silencing of customer PII, human-in‑the‑loop audits on calibrated samples; preserve disagreement in annotations.
- Cost estimates and economic constraints: per-trajectory capture is costly (order‑of‑magnitude $1k for a 4‑hour recorded trajectory); simulated projects incur multi‑person multi‑week costs. The proposal argues pilot-scale collection is tractable and evaluation gating (T3/T4) should decide scaling.
- Evaluation & verification: combine mechanical checks (compilation, tests), corpus-level statistics (difficulty vs baseline model), probing via fine‑tuning small models to detect learning signal, and pre‑registered blind tests on held‑out long‑horizon tasks.
Implications for AI Economics
- Marginal returns and investment prioritization:
- Diminishing returns on enlarging artifact corpora (GitHub scrapes) suggest higher marginal value from collecting triadic data. Labs and funders should reallocate a portion of data budgets toward expensive, high‑signal captures (expert trajectories + simulated companies).
- Pilot-scale investment (12–18 months) with pre-registered blind evaluation provides a clear go/no-go decision framework to avoid sunk‑cost escalation.
- Value capture and market structure:
- Triadic corpora are costly to produce and have high coordination/consent frictions, so they could become proprietary competitive advantages for labs that invest early—raising barriers to entry or creating licensing markets for "contextual engineering" corpora.
- Simulated companies create recurring demand for senior practitioner time; the labor market could see new revenue streams for experts participating in data capture (and potential frictions if firms restrict employee participation).
- Training economics and model design:
- Triadic data specifically targets long‑horizon state handling and cross‑functional coordination—areas where fine‑tuning or post‑training signals may be more effective than further base‑model scale. This affects the optimal allocation between compute (model scale) and data collection.
- The four‑tier evidence requirement aligns incentives toward transparent productization of datasets (with mechanical + statistical + probe + blind evidence), which should reduce wasted downstream fine‑tuning costs and enable better cost/benefit estimation for labs.
- Externalities, governance, and privacy:
- Capture-time sanitization and simulated-company designs can mitigate IP/PII externalities, but regulators and firms will likely demand stricter provenance, consent records, and auditing—raising production costs.
- Public-interest value: making parts of triadic corpora (or at least the evaluation gates) open could accelerate broad capability gains and distribute economic benefits; conversely, privatized corpora may concentrate advantages.
- Labor substitution and augmentation:
- Better long‑horizon agents trained on triadic data could substitute for some senior engineering time on routine cross‑functional tasks but will likely first act as higher‑leverage assistants that change skill composition (increasing demand for people who can work with/oversee agents and for roles that supply the triadic signal).
- Policy and business recommendations (practical implications):
- Fund and run pilot collections (expert trajectories + a small set of simulated companies) with pre‑registered blind evaluations to measure ROI before scaling.
- Encourage standards for the four‑tier evidence framework to reduce asymmetric information between dataset sellers and model fine‑tuners.
- Explore licensing models that balance commercial incentives and public access (e.g., vetted research releases or tiered licenses).
- Anticipate and manage labor-market frictions by compensating participating senior engineers and clarifying IP/consent rules.
Overall, the paper reframes the SWE data problem from "more code or more dyadic logs" to "capture the conversations that produce engineering context." For AI economics, that implies a near-term, capital‑intensive reallocation toward expensive but high‑value triadic data collection, with important implications for returns to data investment, competitive advantage, labor markets, and governance.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. Developer Productivity | negative | high | performance on short-horizon benchmarks versus performance on long-horizon, multi-engineer, ambiguous-specification engineering deliverables |
0.06
|
| The substrate for the next generation of software-engineering (SWE) agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs; it is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. Training Effectiveness | positive | high | effectiveness of training data substrates for improving agent performance on long-horizon, cross-functional engineering tasks |
0.01
|
| The canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. Adoption Rate | positive | high | availability and suitability of dataset modalities (stimulated-recall expert trajectories and simulated instrumented teams) for training next-generation SWE agents |
0.01
|
| Any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher through a four-tier evidence framework: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. Governance And Regulation | positive | high | quality and trustworthiness of fine-tuning corpora as judged by the four-tier framework |
0.03
|
| This triadic data is capturable in 12-18 months with methods already mature in adjacent fields. Training Effectiveness | positive | high | time required to collect a triadic dataset using existing methods |
12-18 months
0.01
|
| This data is the empirical key to four open questions in agent training. Research Productivity | positive | high | resolvability of four open questions in agent training using triadic data |
0.01
|
| The field's near-term research agenda should explicitly include collecting and using triadic data. Adoption Rate | positive | high | inclusion of triadic data collection/use in near-term research agendas in the SWE agent research community |
0.01
|