A benchmark built from real production coding sessions finds current foundation models solve roughly 53–72% of fail-to-pass tasks, and those that execute tests and static analysis perform substantially better, suggesting that exposing codebase-specific verification tools materially improves externally trained agents' effectiveness.
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
Summary
Main Finding
ProdCodeBench is a reproducible methodology (and a concrete benchmark instantiation) for creating execution-based, production-derived benchmarks for AI coding agents. When evaluated on real single-turn developer→agent sessions from a large industrial monorepo, foundation models achieved solve rates of roughly 53%–72% (GPT 5.1 Codex ≈53.2%, Claude Haiku ≈66.0%, Claude Sonnet ≈71.4%, Claude Opus ≈72.2%). Crucially, models that made more use of validation tools (running tests, static checks, iterative verification) and were evaluated in an IDE-like harness performed substantially better — suggesting that exposing verification mechanisms and richer tooling materially improves agent effectiveness in proprietary, out-of-distribution codebases.
Key Points
- Motivation: public benchmarks often differ from production usage in language mix, prompt style, and monorepo structure. Benchmarks derived from real sessions preserve verbatim prompts, language distribution, and realistic diffs.
- Task representation: each sample = verbatim prompt, committed solution diff (hidden at eval time by backing out the change), and a set of executable tests (Fail-to-Pass (F2P) and Pass-to-Pass (P2P)).
- Curation & filtering pipeline:
- Source: single-turn developer-agent sessions instrumented in the IDE; track AI-provenance (which accepted suggestions were kept).
- Prompt filters: remove references to solution diffs, templates, and non-testable prompts using an LLM classifier.
- Test discovery: probabilistic retrieval + an LLM-based Test Relevance Agent to ensure selected tests actually exercise the diff.
- Validation: execute tests pre- and post-diff multiple times; retain stable F2P tests as the primary automated signal; exclude flaky/inconsistent tests.
- Rolling benchmark design: samples are refreshed periodically because monorepo “time travel” (reproducing past environments) is often infeasible.
- Dataset composition: spans seven programming languages; ~75% of tasks have at least one F2P test; remaining tasks use P2P tests for regression assurance.
- Empirical findings:
- Solve rates vary by model and harness; stronger, IDE-like harnesses (navigation, local diagnostics, validation tools) increase solve rates and shrink model gaps.
- Tool usage: higher rates of “run test” and “validate changes” correlate with higher solve rates.
- Context files (developer-authored docs about conventions/builds) help, especially with weak harnesses and company-specific conventions.
- Validation: LLM-powered curation was spot-checked by human annotators (>80% initial agreement), and no-op validation (run with no edits) produced 0% solves, supporting evaluation integrity.
- Limitations: dataset is not publicly released; monorepo constraints force rolling updates; only single-turn sessions were used (multi-turn omitted for reproducibility).
Data & Methods
- Data source: instrumented IDE logs of single-turn developer prompts and subsequent committed diffs (only diffs that landed on main and were single-conversation-derived).
- AI provenance: IDE records which accepted AI suggestions became part of the landed diff to ensure linkage between prompt and committed change.
- Backout mechanism: to avoid leakage, the solution diff is backed out of the repo and the agent is evaluated on the backed-out state so it cannot read the actual committed solution.
- Test discovery and relevance:
- Initial probabilistic retrieval of candidate tests.
- Test Relevance Agent (LLM) uses diff content + code search to decide whether a test is likely impacted by the diff.
- Tests executed on pre-change and post-change versions multiple times to classify as F2P, P2P, or flaky — only stable F2P (and selected P2P regression tests) are retained.
- Filters: prompt quality (no references to solution diffs, testable-only), diff compatibility (exclude diffs that can’t be cleanly backed out), test stability (multi-run), and template removal.
- Harnesses: two harnesses compared — Agent-Basic (limited tools, simpler search) vs Agent-IDE (richer toolset: navigation, diagnostics, formatting, knowledge search). Also evaluated Context Files usage.
- Model evaluations: multiple runs per model (three runs) to measure variance; solve rates computed on F2P subset; tooling usage metrics collected (read/search/run-test/validate etc).
- Manual validation: human annotation of sampled tasks to estimate classifier accuracy and task-type labels.
Implications for AI Economics
- Evaluation investment vs. decision speed: A production-derived rolling benchmark like ProdCodeBench reduces dependence on slow, costly A/B experiments and shadow deployments by providing faster, reproducible signals tailored to an organization’s codebase. Organizations should weigh the (one-time and maintenance) engineering cost of building such a benchmark against the recurring cost and risk of live experiments.
- Value of tool-enabled agents: The economic value of a coding agent is not only its base model quality but also its ability to use environment-specific tooling (test execution, static analyzers, repo navigation). Investments in exposing verification APIs and richer IDE integrations can yield outsized performance gains, which can improve developer productivity and reduce downstream bug/CI costs.
- Procurement and pricing: Buyers should consider model/tooling bundles. Models that demonstrably use verification tools and work well with IDE harnesses may command higher price premiums but deliver better on-codebase ROI. Vendors that enable or include runtime verification capabilities may be more valuable to enterprise customers with proprietary monorepos.
- Cost/benefit trade-offs in infrastructure: Running executable tests and stability checks for evaluations (and for agents in production) requires compute and engineering effort. However, the results indicate iterative verification substantially improves success rates, implying that the marginal cost of enabling test-run feedback loops may be justified by higher task completion and fewer manual fixes.
- Strategic implication for model developers: Because proprietary monorepos are out-of-distribution for public model pretraining, model builders and integrators should prioritize tool-use abilities, local diagnostics, and RL/finetuning with validation-in-the-loop. Providing robust tooling hooks (and well-documented context files) increases agent effectiveness and supports monetizable higher-tier offerings.
- Organizational capability & barriers to entry: Producing a high-quality, production-derived benchmark requires access to developer telemetry, test infrastructure, and the ability to safely back out diffs — resources typically available to large orgs. This creates a competitive barrier: firms that can build such benchmarks and tune agent integrations gain an operational advantage in selecting and deploying agents fitted to their stack.
- Policy & risk management: Execution-based evaluation reduces reliance on subjective LLM judges and improves reproducibility. From a risk-management perspective, that lowers uncertainty in costly deployment decisions and can be factored into project economics and expected value calculations.
Limitations & caveats (economic lens) - The dataset and results are not public, so external researchers or buyers cannot directly benchmark vendor claims against these exact tasks — they must either trust reported metrics or invest in their own ProdCodeBench-style pipelines. - Monorepo rolling design implies ongoing maintenance cost; firms should factor recurring refresh and validation costs into the economic model.
If you want, I can: - Convert these implications into a short decision checklist for CTOs evaluating coding agents. - Estimate rough build/maintenance cost components for an org implementing a ProdCodeBench-style pipeline (engineering time, compute for test runs, storage, annotator effort).
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings. Organizational Efficiency | positive | high | quality of evaluation for AI coding agents (suitability of benchmark) |
0.03
|
| Existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. Other | negative | high | representativeness of benchmarks relative to real usage |
0.18
|
| We present ProdCodeBench, a benchmark built from real sessions with a production AI coding assistant. Other | positive | high | existence and provenance of benchmark (production-derived dataset) |
0.18
|
| Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Other | positive | high | dataset composition (prompt, code change, tests) and language coverage (7 languages) |
seven programming languages
0.3
|
| Systematic analysis of four foundation models yields solve rates from 53.2% to 72.2%. Developer Productivity | positive | high | solve rate (task success rate) |
53.2% to 72.2%
0.3
|
| Models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. Developer Productivity | positive | high | solve rate (task success) as a function of verification tool usage |
0.18
|
| Iterative verification helps achieve effective agent behavior. Developer Productivity | positive | medium | agent effectiveness (behavior leading to task success) |
0.11
|
| Exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. Organizational Efficiency | positive | medium | performance of externally trained agents in unfamiliar codebases |
0.02
|
| We detail data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks to address challenges in constructing reliable evaluation signals from monorepo environments. Other | positive | high | reliability of evaluation signals derived from monorepo environments |
0.3
|
| We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks. Adoption Rate | positive | high | ability of other organizations to construct similar benchmarks |
0.09
|