A benchmark built from real production coding sessions finds current foundation models solve roughly 53–72% of fail-to-pass tasks, and those that execute tests and static analysis perform substantially better, suggesting that exposing codebase-specific verification tools materially improves externally trained agents' effectiveness.

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Smriti Jha, Matteo Paltenghi, Chandra Maddila, Vijayaraghavan Murali, Shubham Ugare, Satish Chandra · April 02, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A production-derived benchmark (ProdCodeBench) shows foundation models solve 53.2–72.2% of real fail-to-pass coding tasks, and models that actively run tests and static analysis achieve higher solve rates, implying iterative verification improves agent effectiveness in unfamiliar codebases.

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

Summary

Main Finding

ProdCodeBench is a reproducible methodology (and a concrete benchmark instantiation) for creating execution-based, production-derived benchmarks for AI coding agents. When evaluated on real single-turn developer→agent sessions from a large industrial monorepo, foundation models achieved solve rates of roughly 53%–72% (GPT 5.1 Codex ≈53.2%, Claude Haiku ≈66.0%, Claude Sonnet ≈71.4%, Claude Opus ≈72.2%). Crucially, models that made more use of validation tools (running tests, static checks, iterative verification) and were evaluated in an IDE-like harness performed substantially better — suggesting that exposing verification mechanisms and richer tooling materially improves agent effectiveness in proprietary, out-of-distribution codebases.

Key Points

Motivation: public benchmarks often differ from production usage in language mix, prompt style, and monorepo structure. Benchmarks derived from real sessions preserve verbatim prompts, language distribution, and realistic diffs.
Task representation: each sample = verbatim prompt, committed solution diff (hidden at eval time by backing out the change), and a set of executable tests (Fail-to-Pass (F2P) and Pass-to-Pass (P2P)).
Curation & filtering pipeline:
- Source: single-turn developer-agent sessions instrumented in the IDE; track AI-provenance (which accepted suggestions were kept).
- Prompt filters: remove references to solution diffs, templates, and non-testable prompts using an LLM classifier.
- Test discovery: probabilistic retrieval + an LLM-based Test Relevance Agent to ensure selected tests actually exercise the diff.
- Validation: execute tests pre- and post-diff multiple times; retain stable F2P tests as the primary automated signal; exclude flaky/inconsistent tests.
- Rolling benchmark design: samples are refreshed periodically because monorepo “time travel” (reproducing past environments) is often infeasible.
Dataset composition: spans seven programming languages; ~75% of tasks have at least one F2P test; remaining tasks use P2P tests for regression assurance.
Empirical findings:
- Solve rates vary by model and harness; stronger, IDE-like harnesses (navigation, local diagnostics, validation tools) increase solve rates and shrink model gaps.
- Tool usage: higher rates of “run test” and “validate changes” correlate with higher solve rates.
- Context files (developer-authored docs about conventions/builds) help, especially with weak harnesses and company-specific conventions.
Validation: LLM-powered curation was spot-checked by human annotators (>80% initial agreement), and no-op validation (run with no edits) produced 0% solves, supporting evaluation integrity.
Limitations: dataset is not publicly released; monorepo constraints force rolling updates; only single-turn sessions were used (multi-turn omitted for reproducibility).

Data & Methods

Data source: instrumented IDE logs of single-turn developer prompts and subsequent committed diffs (only diffs that landed on main and were single-conversation-derived).
AI provenance: IDE records which accepted AI suggestions became part of the landed diff to ensure linkage between prompt and committed change.
Backout mechanism: to avoid leakage, the solution diff is backed out of the repo and the agent is evaluated on the backed-out state so it cannot read the actual committed solution.
Test discovery and relevance:
- Initial probabilistic retrieval of candidate tests.
- Test Relevance Agent (LLM) uses diff content + code search to decide whether a test is likely impacted by the diff.
- Tests executed on pre-change and post-change versions multiple times to classify as F2P, P2P, or flaky — only stable F2P (and selected P2P regression tests) are retained.
Filters: prompt quality (no references to solution diffs, testable-only), diff compatibility (exclude diffs that can’t be cleanly backed out), test stability (multi-run), and template removal.
Harnesses: two harnesses compared — Agent-Basic (limited tools, simpler search) vs Agent-IDE (richer toolset: navigation, diagnostics, formatting, knowledge search). Also evaluated Context Files usage.
Model evaluations: multiple runs per model (three runs) to measure variance; solve rates computed on F2P subset; tooling usage metrics collected (read/search/run-test/validate etc).
Manual validation: human annotation of sampled tasks to estimate classifier accuracy and task-type labels.

Implications for AI Economics

Evaluation investment vs. decision speed: A production-derived rolling benchmark like ProdCodeBench reduces dependence on slow, costly A/B experiments and shadow deployments by providing faster, reproducible signals tailored to an organization’s codebase. Organizations should weigh the (one-time and maintenance) engineering cost of building such a benchmark against the recurring cost and risk of live experiments.
Value of tool-enabled agents: The economic value of a coding agent is not only its base model quality but also its ability to use environment-specific tooling (test execution, static analyzers, repo navigation). Investments in exposing verification APIs and richer IDE integrations can yield outsized performance gains, which can improve developer productivity and reduce downstream bug/CI costs.
Procurement and pricing: Buyers should consider model/tooling bundles. Models that demonstrably use verification tools and work well with IDE harnesses may command higher price premiums but deliver better on-codebase ROI. Vendors that enable or include runtime verification capabilities may be more valuable to enterprise customers with proprietary monorepos.
Cost/benefit trade-offs in infrastructure: Running executable tests and stability checks for evaluations (and for agents in production) requires compute and engineering effort. However, the results indicate iterative verification substantially improves success rates, implying that the marginal cost of enabling test-run feedback loops may be justified by higher task completion and fewer manual fixes.
Strategic implication for model developers: Because proprietary monorepos are out-of-distribution for public model pretraining, model builders and integrators should prioritize tool-use abilities, local diagnostics, and RL/finetuning with validation-in-the-loop. Providing robust tooling hooks (and well-documented context files) increases agent effectiveness and supports monetizable higher-tier offerings.
Organizational capability & barriers to entry: Producing a high-quality, production-derived benchmark requires access to developer telemetry, test infrastructure, and the ability to safely back out diffs — resources typically available to large orgs. This creates a competitive barrier: firms that can build such benchmarks and tune agent integrations gain an operational advantage in selecting and deploying agents fitted to their stack.
Policy & risk management: Execution-based evaluation reduces reliance on subjective LLM judges and improves reproducibility. From a risk-management perspective, that lowers uncertainty in costly deployment decisions and can be factored into project economics and expected value calculations.

Limitations & caveats (economic lens) - The dataset and results are not public, so external researchers or buyers cannot directly benchmark vendor claims against these exact tasks — they must either trust reported metrics or invest in their own ProdCodeBench-style pipelines. - Monorepo rolling design implies ongoing maintenance cost; firms should factor recurring refresh and validation costs into the economic model.

If you want, I can: - Convert these implications into a short decision checklist for CTOs evaluating coding agents. - Estimate rough build/maintenance cost components for an org implementing a ProdCodeBench-style pipeline (engineering time, compute for test runs, storage, annotator effort).

Assessment

Paper Typedescriptive Evidence Strengthmedium — Uses production-derived interaction data, careful curation (LLM-based classification, test relevance validation, multi-run stability checks) and reports clear performance metrics across four foundation models, providing credible descriptive evidence about model behavior; however, findings are observational and correlational (no causal identification), based on a specific production assistant and sample selection rules, and may be confounded by model architecture/training differences and selection biases. Methods Rigormedium — The paper applies systematic curation steps (verbatim prompts, committed code changes, fail-to-pass tests), automated and human-in-the-loop validation, and multi-run stability checks which are good practice for benchmark construction; nevertheless, important details that affect reproducibility and bias assessment (sampling frame, labeling error rates for LLM-based classification, number of samples, exact selection criteria, and model configuration differences) are not fully specified in the summary, limiting assessment of rigor. SampleProdCodeBench: a benchmark curated from real sessions with a production AI coding assistant, where each sample contains the original user prompt, the committed code change, and fail-to-pass tests; samples span seven programming languages and originate from monorepo production environments; evaluated against four foundation models with reported solve rates between 53.2% and 72.2%; curation relied on LLM-based task classification, test relevance validation, and multi-run stability checks. Themesproductivity human_ai_collab GeneralizabilityDerived from a single organization’s production assistant and monorepo workflows, so results may not generalize to open-source or other enterprise codebases, Selection bias toward sessions that produced commits and fail-to-pass tests excludes many real-world interactions (e.g., exploratory prompts, discarded suggestions), Coverage limited to seven programming languages—performance may differ in other languages or stacks, Evaluated on four foundation models only; model-specific training choices and tool integrations could drive observed differences, Benchmarks reflect the structure and tooling of the original codebases (tests, static analysis hooks), which other repositories may lack, Temporal and geographic developer practices not represented—workflow and prompt styles may vary across teams and eras

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings. Organizational Efficiency	positive	high	quality of evaluation for AI coding agents (suitability of benchmark)	0.03
Existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. Other	negative	high	representativeness of benchmarks relative to real usage	0.18
We present ProdCodeBench, a benchmark built from real sessions with a production AI coding assistant. Other	positive	high	existence and provenance of benchmark (production-derived dataset)	0.18
Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Other	positive	high	dataset composition (prompt, code change, tests) and language coverage (7 languages)	seven programming languages 0.3
Systematic analysis of four foundation models yields solve rates from 53.2% to 72.2%. Developer Productivity	positive	high	solve rate (task success rate)	53.2% to 72.2% 0.3
Models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. Developer Productivity	positive	high	solve rate (task success) as a function of verification tool usage	0.18
Iterative verification helps achieve effective agent behavior. Developer Productivity	positive	medium	agent effectiveness (behavior leading to task success)	0.11
Exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. Organizational Efficiency	positive	medium	performance of externally trained agents in unfamiliar codebases	0.02
We detail data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks to address challenges in constructing reliable evaluation signals from monorepo environments. Other	positive	high	reliability of evaluation signals derived from monorepo environments	0.3
We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks. Adoption Rate	positive	high	ability of other organizations to construct similar benchmarks	0.09