Re-evaluating EVMbench finds AI agents fall short of replacing human auditors: expanded, contamination-free tests show unstable rankings, scaffold sensitivity and no end-to-end exploit success on post-release incidents, implying AI is best used for breadth and triage within human-in-the-loop workflows.
EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.
Summary
Main Finding
EVMbench’s optimistic claim that fully automated AI auditing is imminent is overstated. When expanding evaluation breadth and removing training-data contamination, agent performance is unstable across configurations and tasks, fails to achieve end-to-end exploitability on real-world incidents, and is sensitive to scaffolding. AI agents are useful as breadth tools and pre-deployment checks but cannot replace human auditors; best use is human-in-the-loop workflows.
Key Points
- Background: EVMbench (OpenAI, Paradigm, OtterSec) was the first large-scale benchmark for AI agents on smart-contract security. It reported agents detecting up to 45.6% of vulnerabilities and exploiting 72.2% of a curated subset.
- Limitations of original EVMbench:
- Narrow evaluation: 14 agent configurations, most models evaluated only on their vendor-provided scaffold.
- Data contamination risk: benchmark relied on audit-contest data published before every model’s release, which models may have seen during training.
- Study improvements:
- Expanded to 26 agent configurations spanning four model families and three scaffolds.
- Constructed a contamination-free dataset of 22 real-world security incidents that postdate every model’s release.
- New evaluation findings:
- Instability: detection/exploitation rankings shift across model configurations, tasks, and datasets — results are not robust to evaluation choices.
- Real-world failures: on the 22 postdating incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs, even though agents detected up to 65% of vulnerabilities in some settings. This contradicts EVMbench’s conclusion that discovery (finding the bugs) is the main bottleneck.
- Scaffold sensitivity: choice of scaffold materially affects outcomes; an open-source scaffold outperformed vendor scaffolds by up to ~5 percentage points, and EVMbench did not control for scaffold effects.
- Practical takeaway: agents reliably catch well-known patterns and respond strongly to human-provided context but lack the protocol-specific and adversarial reasoning needed to replace human judgment.
Data & Methods
- Expanded evaluation matrix: 26 agent configurations, four model families, three scaffolding approaches (vendor scaffolds + at least one open-source scaffold).
- Contamination control: curated a dataset of 22 real-world smart-contract security incidents that occurred after the release date of every evaluated model to ensure no leakage into model training data.
- Tasks measured:
- Vulnerability detection rates (finding potential issues).
- Exploitation success (end-to-end exploit generation and execution) on a curated subset and on the contamination-free incidents.
- Outcome metrics reported: percentage of vulnerabilities detected, percentage of successful exploitations, and stability of rankings across configurations and datasets.
- Code & data released: https://github.com/blocksecteam/ReEVMBench/
Implications for AI Economics
- Overstated automation value: Claims that AI will imminently replace human auditors may inflate expected productivity gains and lead to misallocated investment in fully automated pipelines. Real-world economic benefits are more likely to come from complementary automation (breadth + triage) rather than full substitution.
- Comparative performance uncertainty: Instability of agent rankings across configurations means procurement and deployment decisions based on narrow benchmarks are risky. Firms should evaluate agents under their own scaffolds, datasets, and workflows before committing resources.
- Importance of human capital: Human auditors provide protocol-specific knowledge, adversarial thinking, and judgment that agents lack. This strengthens the economic case for hybrid human–AI workflows where AI lowers the marginal cost of scanning many contracts but humans handle high-value adversarial tasks.
- Market for tooling and scaffolds: Scaffold choice materially affects performance, implying value for third-party, open-source scaffolding tools and integration services that improve reproducibility and robustness—an economic opportunity for tool providers.
- Policy and procurement: Contamination in benchmarks can mislead buyers and regulators about capabilities; buyers and policymakers should insist on contamination-free, real-world evaluations when assessing claims of automated security tools.
Assessment
Claims (14)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| EVMbench (OpenAI, Paradigm, OtterSec) reported agents detecting up to 45.6% of vulnerabilities and achieving exploitation on 72.2% of a curated subset. Output Quality | positive | high | vulnerability_detection_rate; exploitation_success_rate (on curated subset) |
EVMbench reported up to 45.6% vulnerability detection and 72.2% exploitation on a curated subset (reported benchmark metrics)
0.3
|
| The original EVMbench evaluation was narrow: it evaluated 14 agent configurations and most models were tested only with their vendor-provided scaffold. Other | negative | high | evaluation_breadth (number_of_agent_configurations; scaffold_variety) |
n=14
original EVMbench evaluated 14 agent configurations; most models tested with vendor-provided scaffold
0.3
|
| The original EVMbench had a data contamination risk because it relied on audit-contest data published before every evaluated model's release, which could have been seen during model training. Other | negative | high | dataset_contamination_risk (potential_training_data_leakage) |
original EVMbench posed a dataset contamination risk because it relied on audit-contest data published before evaluated model releases
0.3
|
| This study expanded the evaluation matrix to 26 agent configurations spanning four model families and three scaffolding approaches. Other | positive | high | evaluation_matrix_size (agent_configurations; model_families; scaffolds) |
n=26
this study expanded evaluation to 26 agent configurations across four model families and three scaffolds
0.3
|
| The authors constructed a contamination-free dataset of 22 real-world smart-contract security incidents that postdate every evaluated model's release. Other | positive | high | contamination_free_dataset_size (22 incidents) |
n=22
contamination-free dataset constructed of 22 real-world incidents that postdate every evaluated model's release
0.3
|
| Detection and exploitation rankings are unstable: rankings shift across model configurations, tasks, and datasets, so results are not robust to evaluation choices. Other | negative | medium | ranking_stability (consistency_of_model_rankings_across_configs_and_datasets) |
detection and exploitation rankings are unstable across model configs, tasks, and datasets (results sensitive to evaluation choices)
0.18
|
| On the 22 postdating (contamination-free) incidents, no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated. Output Quality | negative | high | end_to_end_exploitation_success_rate (per_agent_per_incident) |
n=110
no agent achieved end-to-end exploitation success across all 110 agent–incident pairs evaluated
0.3
|
| Agents detected up to 65% of vulnerabilities in some experimental settings. Output Quality | positive | high | vulnerability_detection_rate (peak_value_reported = ~65%) |
agents detected up to ~65% of vulnerabilities in some experimental settings (peak detection rate)
0.3
|
| Choice of scaffold materially affects outcomes: an open-source scaffold outperformed vendor-provided scaffolds by up to approximately 5 percentage points. Output Quality | mixed | high | performance_difference_across_scaffolds (detection/exploitation_rates_difference_in_percentage_points) |
open-source scaffold outperformed vendor scaffolds by up to ≈5 percentage points
0.3
|
| AI agents are useful as breadth tools and for pre-deployment checks but lack the protocol-specific and adversarial reasoning required to replace human auditors; human-in-the-loop workflows are the best use. Task Allocation | mixed | medium | practical_utility_of_agents (breadth_detection_utility; inability_to_substitute_human_judgment) |
agents useful for breadth and pre-deployment checks but lack protocol-specific/adversarial reasoning to replace human auditors; human-in-the-loop workflows recommended
0.18
|
| Claims that AI will imminently replace human auditors are overstated; real-world economic benefits are more likely to come from complementary automation (breadth + triage) rather than full substitution. Firm Productivity | negative | medium | economic_value_of_automation (qualitative_assessment_of_substitution_vs_complementarity) |
claims of imminent replacement of human auditors overstated; economic benefits likely from complementary automation (breadth + triage) not full substitution
0.18
|
| Instability of agent rankings across configurations makes procurement and deployment decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds, datasets, and workflows before committing. Organizational Efficiency | negative | medium | robustness_of_benchmark_based_procurement (risk_of_misleading_benchmarks) |
instability of rankings across configurations makes procurement decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds/datasets/workflows
0.18
|
| Scaffold choice creates an economic opportunity for third-party tooling and open-source scaffolding because scaffold effects materially affect performance and reproducibility. Market Structure | positive | medium | market_opportunity_for_scaffold_tools (qualitative_based_on_performance_impact) |
scaffold choice materially affects performance -> economic opportunity for third-party/open-source scaffolding tools
0.18
|
| The authors released their code and data for reproducibility at https://github.com/blocksecteam/ReEVMBench/. Other | null_result | high | code_and_data_availability (repository_link) |
authors released code and data at https://github.com/blocksecteam/ReEVMBench/
0.3
|