Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.

Summary

Main Finding

EVMbench’s optimistic claim that fully automated AI auditing is imminent is overstated. When expanding evaluation breadth and removing training-data contamination, agent performance is unstable across configurations and tasks, fails to achieve end-to-end exploitability on real-world incidents, and is sensitive to scaffolding. AI agents are useful as breadth tools and pre-deployment checks but cannot replace human auditors; best use is human-in-the-loop workflows.

Key Points

Background: EVMbench (OpenAI, Paradigm, OtterSec) was the first large-scale benchmark for AI agents on smart-contract security. It reported agents detecting up to 45.6% of vulnerabilities and exploiting 72.2% of a curated subset.
Limitations of original EVMbench:
- Narrow evaluation: 14 agent configurations, most models evaluated only on their vendor-provided scaffold.
- Data contamination risk: benchmark relied on audit-contest data published before every model’s release, which models may have seen during training.
Study improvements:
- Expanded to 26 agent configurations spanning four model families and three scaffolds.
- Constructed a contamination-free dataset of 22 real-world security incidents that postdate every model’s release.
New evaluation findings:
Instability: detection/exploitation rankings shift across model configurations, tasks, and datasets — results are not robust to evaluation choices.
Real-world failures: on the 22 postdating incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs, even though agents detected up to 65% of vulnerabilities in some settings. This contradicts EVMbench’s conclusion that discovery (finding the bugs) is the main bottleneck.
Scaffold sensitivity: choice of scaffold materially affects outcomes; an open-source scaffold outperformed vendor scaffolds by up to ~5 percentage points, and EVMbench did not control for scaffold effects.
Practical takeaway: agents reliably catch well-known patterns and respond strongly to human-provided context but lack the protocol-specific and adversarial reasoning needed to replace human judgment.

Data & Methods

Expanded evaluation matrix: 26 agent configurations, four model families, three scaffolding approaches (vendor scaffolds + at least one open-source scaffold).
Contamination control: curated a dataset of 22 real-world smart-contract security incidents that occurred after the release date of every evaluated model to ensure no leakage into model training data.
Tasks measured:
- Vulnerability detection rates (finding potential issues).
- Exploitation success (end-to-end exploit generation and execution) on a curated subset and on the contamination-free incidents.
Outcome metrics reported: percentage of vulnerabilities detected, percentage of successful exploitations, and stability of rankings across configurations and datasets.
Code & data released: https://github.com/blocksecteam/ReEVMBench/

Implications for AI Economics

Overstated automation value: Claims that AI will imminently replace human auditors may inflate expected productivity gains and lead to misallocated investment in fully automated pipelines. Real-world economic benefits are more likely to come from complementary automation (breadth + triage) rather than full substitution.
Comparative performance uncertainty: Instability of agent rankings across configurations means procurement and deployment decisions based on narrow benchmarks are risky. Firms should evaluate agents under their own scaffolds, datasets, and workflows before committing resources.
Importance of human capital: Human auditors provide protocol-specific knowledge, adversarial thinking, and judgment that agents lack. This strengthens the economic case for hybrid human–AI workflows where AI lowers the marginal cost of scanning many contracts but humans handle high-value adversarial tasks.
Market for tooling and scaffolds: Scaffold choice materially affects performance, implying value for third-party, open-source scaffolding tools and integration services that improve reproducibility and robustness—an economic opportunity for tool providers.
Policy and procurement: Contamination in benchmarks can mislead buyers and regulators about capabilities; buyers and policymakers should insist on contamination-free, real-world evaluations when assessing claims of automated security tools.

Assessment

Paper Typedescriptive Evidence Strengthhigh — The study improves internal validity relative to prior work by expanding agent configurations, controlling for training-data contamination with a curated set of 22 post-release real-world incidents, and releasing code/data for reproducibility; results are consistent across multiple metrics (detection, exploitation, stability) and show clear scaffold sensitivity. Remaining limitations (small incident sample, domain specificity, evolving models) constrain external generalization but do not undermine the core empirical claims about current agent capabilities in smart-contract audits. Methods Rigorhigh — Robust evaluation design: larger evaluation matrix (26 agent configurations, four model families, three scaffolds), explicit contamination control via postdating incidents, multiple outcome metrics (detection and end-to-end exploitation), and public code/data; the paper also analyzes stability across configurations and scaffolds rather than relying on a single benchmark setup. Sample26 agent configurations spanning four model families and three scaffolding approaches (vendor scaffolds plus at least one open-source scaffold); contamination-free dataset of 22 real-world smart-contract security incidents that postdate all evaluated models; additional evaluations on curated subsets from the original EVMbench; outcome measurement across ~110 agent–incident pairs for detection and exploitation tasks; code and data publicly released. Themeshuman_ai_collab productivity adoption org_design GeneralizabilityDomain-specific: focused on Ethereum/smart-contract security and may not generalize to other AI auditing or software-security tasks, Small contamination-free sample: only 22 postdating incidents, limiting statistical power and coverage of possible vulnerability types, Model coverage limited: four model families and specific versions evaluated; newer or specialized models may perform differently, Scaffold/workflow dependency: performance sensitive to scaffolding and human-provided context, so organizational integration will affect outcomes, Temporal limitation: models and adversarial tactics evolve rapidly, so results reflect capabilities at a fixed point in time, Does not measure macroeconomic outcomes: does not directly quantify productivity, wages, or labor-market impacts

Claims (14)

Claim	Direction	Confidence	Outcome	Details
EVMbench (OpenAI, Paradigm, OtterSec) reported agents detecting up to 45.6% of vulnerabilities and achieving exploitation on 72.2% of a curated subset. Output Quality	positive	high	vulnerability_detection_rate; exploitation_success_rate (on curated subset)	EVMbench reported up to 45.6% vulnerability detection and 72.2% exploitation on a curated subset (reported benchmark metrics) 0.3
The original EVMbench evaluation was narrow: it evaluated 14 agent configurations and most models were tested only with their vendor-provided scaffold. Other	negative	high	evaluation_breadth (number_of_agent_configurations; scaffold_variety)	n=14 original EVMbench evaluated 14 agent configurations; most models tested with vendor-provided scaffold 0.3
The original EVMbench had a data contamination risk because it relied on audit-contest data published before every evaluated model's release, which could have been seen during model training. Other	negative	high	dataset_contamination_risk (potential_training_data_leakage)	original EVMbench posed a dataset contamination risk because it relied on audit-contest data published before evaluated model releases 0.3
This study expanded the evaluation matrix to 26 agent configurations spanning four model families and three scaffolding approaches. Other	positive	high	evaluation_matrix_size (agent_configurations; model_families; scaffolds)	n=26 this study expanded evaluation to 26 agent configurations across four model families and three scaffolds 0.3
The authors constructed a contamination-free dataset of 22 real-world smart-contract security incidents that postdate every evaluated model's release. Other	positive	high	contamination_free_dataset_size (22 incidents)	n=22 contamination-free dataset constructed of 22 real-world incidents that postdate every evaluated model's release 0.3
Detection and exploitation rankings are unstable: rankings shift across model configurations, tasks, and datasets, so results are not robust to evaluation choices. Other	negative	medium	ranking_stability (consistency_of_model_rankings_across_configs_and_datasets)	detection and exploitation rankings are unstable across model configs, tasks, and datasets (results sensitive to evaluation choices) 0.18
On the 22 postdating (contamination-free) incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated. Output Quality	negative	high	end_to_end_exploitation_success_rate (per_agent_per_incident)	n=110 no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated 0.3
Agents detected up to 65% of vulnerabilities in some experimental settings. Output Quality	positive	high	vulnerability_detection_rate (peak_value_reported = ~65%)	agents detected up to ~65% of vulnerabilities in some experimental settings (peak detection rate) 0.3
Choice of scaffold materially affects outcomes: an open-source scaffold outperformed vendor-provided scaffolds by up to approximately 5 percentage points. Output Quality	mixed	high	performance_difference_across_scaffolds (detection/exploitation_rates_difference_in_percentage_points)	open-source scaffold outperformed vendor scaffolds by up to ≈5 percentage points 0.3
AI agents are useful as breadth tools and for pre-deployment checks but lack the protocol-specific and adversarial reasoning required to replace human auditors; human-in-the-loop workflows are the best use. Task Allocation	mixed	medium	practical_utility_of_agents (breadth_detection_utility; inability_to_substitute_human_judgment)	agents useful for breadth and pre-deployment checks but lack protocol-specific/adversarial reasoning to replace human auditors; human-in-the-loop workflows recommended 0.18
Claims that AI will imminently replace human auditors are overstated; real-world economic benefits are more likely to come from complementary automation (breadth + triage) rather than full substitution. Firm Productivity	negative	medium	economic_value_of_automation (qualitative_assessment_of_substitution_vs_complementarity)	claims of imminent replacement of human auditors overstated; economic benefits likely from complementary automation (breadth + triage) not full substitution 0.18
Instability of agent rankings across configurations makes procurement and deployment decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds, datasets, and workflows before committing. Organizational Efficiency	negative	medium	robustness_of_benchmark_based_procurement (risk_of_misleading_benchmarks)	instability of rankings across configurations makes procurement decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds/datasets/workflows 0.18
Scaffold choice creates an economic opportunity for third-party tooling and open-source scaffolding because scaffold effects materially affect performance and reproducibility. Market Structure	positive	medium	market_opportunity_for_scaffold_tools (qualitative_based_on_performance_impact)	scaffold choice materially affects performance -> economic opportunity for third-party/open-source scaffolding tools 0.18
The authors released their code and data for reproducibility at https://github.com/blocksecteam/ReEVMBench/. Other	null_result	high	code_and_data_availability (repository_link)	authors released code and data at https://github.com/blocksecteam/ReEVMBench/ 0.3