Large language models can autonomously breach simulated servers: tested across 300 targets, 19 models show penetration success rates from roughly 11% to nearly 70%, and capability gains track improved attack performance; however, results are based on simplified, controlled environments and may overstate real-world risk.
Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.
Summary
Main Finding
The authors build a more realistic, transparent benchmark to measure whether LLM-powered AI agents can autonomously penetrate remote hosts (i.e., obtain shell access). Across 19 models (open and proprietary, multiple families/scales) they observe end-to-end penetration success rates between 10.7% and 69.3%. Success correlates with general model capability; current main bottlenecks for stronger models are tool capability and tool usage rather than pure reasoning. The study warns that frontier LLMs already possess preliminary autonomous penetration capabilities and that these risks will grow as models and tool integrations improve.
Key Points
- Motivation and gaps in prior work:
- Prior evaluations are often opaque, use unrealistic CTF/flag tasks, or give LLMs excessive target-specific prior knowledge.
- These practices over- or under-estimate real-world autonomous penetration risk.
- New benchmark and evaluation principles:
- Black-box setting: agents are given only the target IP (no service/version hints, no exploitation recipes).
- Realistic multi-service targets: each server hosts one vulnerable service plus background secure services to create reconnaissance noise.
- End-to-end objective: success = interactive shell/arbitrary code execution (not just finding a “flag”).
- Experimental design:
- 30 curated CVEs (real-world FOSS vulnerabilities enabling RCE).
- 300 target servers: for each CVE, 5 Tier-1 instances (1 vulnerable + 1 secure) and 5 Tier-2 instances (1 vulnerable + 3 secure).
- General-purpose agent scaffolding (thinking, memory, tool modules) with standard cybersecurity tools (e.g., Nmap, WhatWeb, Metasploit). No penetration-specific prompt engineering or task-specific guidance.
- Results and analysis:
- Penetration success across 19 models: 10.7%–69.3%.
- Capability trend: more capable models perform better (emergent risk as capability increases).
- Failure modes: for stronger models, constraints are often imperfect tool integration, buggy or limited tooling, and improper tool usage rather than core reasoning deficits.
- Ethical and release stance:
- Authors publicly release scaffolding and dataset with responsible-disclosure restrictions and usage guidance aimed at defensive research and controlled testing; acknowledge dual-use risks and reserve rights to restrict components if misuse increases.
Data & Methods
- Targets:
- 30 CVEs chosen for remote code execution impact.
- Two complexity tiers:
- Tier 1: 1 vulnerable service + 1 secure service.
- Tier 2: 1 vulnerable service + 3 secure services.
- For each CVE: 5 Tier-1 and 5 Tier-2 instances (30 CVEs × 10 instances = 300 servers).
- Host protections enabled to mirror real deployments (e.g., NX/DEP, ASLR).
- Agent scaffolding:
- Lightweight, general-purpose agent architecture (thinking/planning module, memory module, tool module).
- Interaction loop: plan → generate textual action → parse to tool calls via a Model Context Protocol → execute tools → observe outputs → update plan.
- Tools provided: common reconnaissance/exploitation tools used in DevOps and security (explicitly includes Nmap, WhatWeb, Metasploit; others typical of such toolkits).
- No task-specific prompts, no hints about services/ports/versions, no handcrafted exploitation scripts.
- Models and evaluation:
- 19 LLMs spanning open-weight and proprietary families/scales (paper evaluates cross-family performance; exact model list in paper repository).
- Success metric: achieving shell access / arbitrary code execution on the target host (proof-of-concept verification).
- Controlled environment; single-host compromise measured (no automated lateral movement, persistence, or large-scale campaign simulated).
- Reproducibility and release:
- Agent scaffolding and dataset released on GitHub under responsible-use terms; sensitive or operational exploit infrastructure/secrets not released.
Implications for AI Economics
- Negative externalities and systemic risk:
- Increasing autonomous offensive capability raises the expected social cost of cyberattacks (larger tail risk, faster scale), implying higher externalities not internalized by model developers or attackers. This can justify regulation and coordinated mitigation funding.
- Market responses and incentives:
- Cybersecurity demand shock: firms face higher expected breach probability and severity, prompting higher spending on defensive technologies, red-teaming, hardened architecture, and incident response—shifting budgets and labor demand.
- Cyber-insurance: insurers will face greater asymmetric-information and moral-hazard problems; premiums may rise, coverage may be tightened, or exclusions on AI-powered attack vectors may appear, affecting firm risk management costs.
- Managed access and commercialization strategies: AI model providers may restrict access or introduce graduated access controls (e.g., gated APIs, tool-use restrictions). Those choices alter market competition (firms that restrict access might incur regulatory favor but lose some market share).
- Labor and productivity effects:
- Dual forces: automation of routine pentesting tasks could lower costs for defenders and increase supply of defensive services, while lowering the barrier for attackers to launch more automated or scalable campaigns. Net welfare depends on which side scales faster and on defense adoption.
- Public goods and market failures:
- Need for public investments in open, transparent benchmarks, threat intelligence sharing, and defensive tool R&D to correct coordination failures. The paper’s public release (with caveats) is an example of creating a public good but also highlights dual-use governance needs.
- Policy and regulatory implications with economic effects:
- Potential regulation of model/tool interfaces (control of tool integration that enables exploitation) and liability rules for model providers. These would impose compliance costs but could reduce externalities.
- Subsidies or standards for defensive R&D, and international coordination to prevent regulatory arbitrage—affect global competitiveness and compliance burdens.
- Strategic and geopolitical dimensions:
- Differential access to frontier models and tooling could produce asymmetric cyber capabilities across firms and states, increasing incentives for defensive armament and possibly affecting investment in cyber-offense/defense capabilities.
- Practical economic recommendations:
- Firms should internalize shock preparedness: invest in automated detection, segmentation, and hardened defaults; factor higher probabilities of AI-assisted attacks into cyber-incident expected cost models.
- Insurers, regulators, and industry consortia should incentivize or require audits and capability red-teaming for high-risk model deployments; create standards for tooling access and logging to reduce misuse.
- Public-private partnerships to fund and govern release of dual-use evaluation assets and to create incentives for secure-by-design tool integration.
Summary takeaway: this study provides concrete evidence that autonomous penetration is an emergent, measurable capability of LLM-powered agents under realistic constraints. From an economics perspective, that increases expected cyber risk, shifting incentives toward greater defensive investment, regulatory intervention, changes in insurance markets, and the need for coordinated public goods to manage a growing negative externality.
Assessment
Claims (7)
| Claim | Direction | Outcome | Confidence & Evidence | Details |
|---|---|---|---|---|
| Existing evaluations of autonomous penetration capabilities often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this capability in high-impact scenarios. Other | negative | quality/realism of prior evaluation methodologies for autonomous penetration |
Reading fidelity
high
Study strength
medium
|
|
| We construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Other | positive | existence and design of a new evaluation framework |
Reading fidelity
high
Study strength
high
|
|
| On the target-server side we design two levels of target environments based on the number of secure services deployed alongside a vulnerable service: Tier 1 (one secure service) and Tier 2 (three secure services), resulting in a total of 300 target servers. Other | positive | number and configuration of target servers in the evaluation |
Reading fidelity
high
Study strength
high
|
n=300
300 target servers
|
| The agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. Other | positive | agent design (tooling and prior knowledge constraints) |
Reading fidelity
high
Study strength
high
|
|
| We evaluate 19 open-weight and proprietary LLMs. Other | neutral | number of LLMs evaluated |
Reading fidelity
high
Study strength
high
|
n=19
19 models
|
| Current models achieve penetration success rates ranging from 10.7% to 69.3%. Other | mixed | penetration success rate |
Reading fidelity
high
Study strength
high
|
n=300
10.7% to 69.3%
|
| Autonomous penetration capability continues to improve alongside advances in overall model capability. Other | positive | relationship between overall model capability and autonomous penetration success |
Reading fidelity
medium
Study strength
medium
|
n=19
|