Large language models can autonomously breach simulated servers: tested across 300 targets, 19 models show penetration success rates from roughly 11% to nearly 70%, and capability gains track improved attack performance; however, results are based on simplified, controlled environments and may overstate real-world risk.

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu, Weibing Wang, Yawen Duan, Brian Tse, Geng Hong, Xudong Pan, Yuan Zhang, Min Yang · June 11, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Using a standardized agent framework and 300 simulated targets, the study finds that 19 LLMs achieve autonomous penetration success rates between 10.7% and 69.3%, with higher-capability models performing better.

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

Summary

Main Finding

The authors build a more realistic, transparent benchmark to measure whether LLM-powered AI agents can autonomously penetrate remote hosts (i.e., obtain shell access). Across 19 models (open and proprietary, multiple families/scales) they observe end-to-end penetration success rates between 10.7% and 69.3%. Success correlates with general model capability; current main bottlenecks for stronger models are tool capability and tool usage rather than pure reasoning. The study warns that frontier LLMs already possess preliminary autonomous penetration capabilities and that these risks will grow as models and tool integrations improve.

Key Points

Motivation and gaps in prior work:
- Prior evaluations are often opaque, use unrealistic CTF/flag tasks, or give LLMs excessive target-specific prior knowledge.
- These practices over- or under-estimate real-world autonomous penetration risk.
New benchmark and evaluation principles:
- Black-box setting: agents are given only the target IP (no service/version hints, no exploitation recipes).
- Realistic multi-service targets: each server hosts one vulnerable service plus background secure services to create reconnaissance noise.
- End-to-end objective: success = interactive shell/arbitrary code execution (not just finding a “flag”).
Experimental design:
- 30 curated CVEs (real-world FOSS vulnerabilities enabling RCE).
- 300 target servers: for each CVE, 5 Tier-1 instances (1 vulnerable + 1 secure) and 5 Tier-2 instances (1 vulnerable + 3 secure).
- General-purpose agent scaffolding (thinking, memory, tool modules) with standard cybersecurity tools (e.g., Nmap, WhatWeb, Metasploit). No penetration-specific prompt engineering or task-specific guidance.
Results and analysis:
- Penetration success across 19 models: 10.7%–69.3%.
- Capability trend: more capable models perform better (emergent risk as capability increases).
- Failure modes: for stronger models, constraints are often imperfect tool integration, buggy or limited tooling, and improper tool usage rather than core reasoning deficits.
Ethical and release stance:
- Authors publicly release scaffolding and dataset with responsible-disclosure restrictions and usage guidance aimed at defensive research and controlled testing; acknowledge dual-use risks and reserve rights to restrict components if misuse increases.

Data & Methods

Targets:
- 30 CVEs chosen for remote code execution impact.
- Two complexity tiers:
  - Tier 1: 1 vulnerable service + 1 secure service.
  - Tier 2: 1 vulnerable service + 3 secure services.
- For each CVE: 5 Tier-1 and 5 Tier-2 instances (30 CVEs × 10 instances = 300 servers).
- Host protections enabled to mirror real deployments (e.g., NX/DEP, ASLR).
Agent scaffolding:
- Lightweight, general-purpose agent architecture (thinking/planning module, memory module, tool module).
- Interaction loop: plan → generate textual action → parse to tool calls via a Model Context Protocol → execute tools → observe outputs → update plan.
- Tools provided: common reconnaissance/exploitation tools used in DevOps and security (explicitly includes Nmap, WhatWeb, Metasploit; others typical of such toolkits).
- No task-specific prompts, no hints about services/ports/versions, no handcrafted exploitation scripts.
Models and evaluation:
- 19 LLMs spanning open-weight and proprietary families/scales (paper evaluates cross-family performance; exact model list in paper repository).
- Success metric: achieving shell access / arbitrary code execution on the target host (proof-of-concept verification).
- Controlled environment; single-host compromise measured (no automated lateral movement, persistence, or large-scale campaign simulated).
Reproducibility and release:
- Agent scaffolding and dataset released on GitHub under responsible-use terms; sensitive or operational exploit infrastructure/secrets not released.

Implications for AI Economics

Negative externalities and systemic risk:
- Increasing autonomous offensive capability raises the expected social cost of cyberattacks (larger tail risk, faster scale), implying higher externalities not internalized by model developers or attackers. This can justify regulation and coordinated mitigation funding.
Market responses and incentives:
- Cybersecurity demand shock: firms face higher expected breach probability and severity, prompting higher spending on defensive technologies, red-teaming, hardened architecture, and incident response—shifting budgets and labor demand.
- Cyber-insurance: insurers will face greater asymmetric-information and moral-hazard problems; premiums may rise, coverage may be tightened, or exclusions on AI-powered attack vectors may appear, affecting firm risk management costs.
- Managed access and commercialization strategies: AI model providers may restrict access or introduce graduated access controls (e.g., gated APIs, tool-use restrictions). Those choices alter market competition (firms that restrict access might incur regulatory favor but lose some market share).
Labor and productivity effects:
- Dual forces: automation of routine pentesting tasks could lower costs for defenders and increase supply of defensive services, while lowering the barrier for attackers to launch more automated or scalable campaigns. Net welfare depends on which side scales faster and on defense adoption.
Public goods and market failures:
- Need for public investments in open, transparent benchmarks, threat intelligence sharing, and defensive tool R&D to correct coordination failures. The paper’s public release (with caveats) is an example of creating a public good but also highlights dual-use governance needs.
Policy and regulatory implications with economic effects:
- Potential regulation of model/tool interfaces (control of tool integration that enables exploitation) and liability rules for model providers. These would impose compliance costs but could reduce externalities.
- Subsidies or standards for defensive R&D, and international coordination to prevent regulatory arbitrage—affect global competitiveness and compliance burdens.
Strategic and geopolitical dimensions:
- Differential access to frontier models and tooling could produce asymmetric cyber capabilities across firms and states, increasing incentives for defensive armament and possibly affecting investment in cyber-offense/defense capabilities.
Practical economic recommendations:
- Firms should internalize shock preparedness: invest in automated detection, segmentation, and hardened defaults; factor higher probabilities of AI-assisted attacks into cyber-incident expected cost models.
- Insurers, regulators, and industry consortia should incentivize or require audits and capability red-teaming for high-risk model deployments; create standards for tooling access and logging to reduce misuse.
- Public-private partnerships to fund and govern release of dual-use evaluation assets and to create incentives for secure-by-design tool integration.

Summary takeaway: this study provides concrete evidence that autonomous penetration is an emergent, measurable capability of LLM-powered agents under realistic constraints. From an economics perspective, that increases expected cyber risk, shifting incentives toward greater defensive investment, regulatory intervention, changes in insurance markets, and the need for coordinated public goods to manage a growing negative externality.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides direct, systematic measurements of LLM-driven autonomous penetration across 300 synthetic targets and 19 models, which gives concrete empirical evidence about capability levels; however, results are confined to simulated environments, a restricted set of services/vulnerabilities, and controlled agent scaffolding, limiting external validity and the ability to infer real-world attack success or economic impacts. Methods Rigormedium — The evaluation is carefully designed (two-tier target taxonomy, large sample of targets, many open and proprietary models, and standardized agent scaffolding with no target-specific prior knowledge), but it lacks real-world defense dynamics, has potential selection and tooling biases, offers limited ablation or sensitivity analysis reported here, and depends on closed-model access that may not be reproducible or stable over time. Sample300 simulated target servers (two tiers: Tier 1 with one secure service plus a vulnerable service, Tier 2 with three secure services plus a vulnerable service) evaluated using a general-purpose agent architecture and a set of general cybersecurity tools; 19 LLMs (mix of open-weight and proprietary) were tested, with autonomous agents given no target-specific prior knowledge. Themesgovernance adoption GeneralizabilitySimulated server environments may not capture real-world network complexity, heterogeneity, or defensive systems (IDS/IPS, monitoring, patching cadence)., Limited set of services and vulnerability types likely underrepresents the diversity of real-world attack surfaces., No active defenders or realistic detection/response dynamics included, so success rates may be optimistic relative to defended environments., Agent scaffolding and tooling choices constrain behavior and may not match what deployed attackers would use or what different interfaces enable., Proprietary model performance may change over time and is often irreproducible by third parties., Evaluation metrics and definition of 'penetration success' may be narrow and not capture partial compromises or downstream impacts., Results may not generalize to large-scale, multi-stage, or long-duration campaigns in the wild.

Claims (7)

Claim	Direction	Outcome	Confidence & Evidence	Details
Existing evaluations of autonomous penetration capabilities often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this capability in high-impact scenarios. Other	negative	quality/realism of prior evaluation methodologies for autonomous penetration	Reading fidelity high Study strength medium	0.18
We construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Other	positive	existence and design of a new evaluation framework	Reading fidelity high Study strength high	0.3
On the target-server side we design two levels of target environments based on the number of secure services deployed alongside a vulnerable service: Tier 1 (one secure service) and Tier 2 (three secure services), resulting in a total of 300 target servers. Other	positive	number and configuration of target servers in the evaluation	Reading fidelity high Study strength high	n=300 300 target servers 0.3
The agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. Other	positive	agent design (tooling and prior knowledge constraints)	Reading fidelity high Study strength high	0.3
We evaluate 19 open-weight and proprietary LLMs. Other	neutral	number of LLMs evaluated	Reading fidelity high Study strength high	n=19 19 models 0.3
Current models achieve penetration success rates ranging from 10.7% to 69.3%. Other	mixed	penetration success rate	Reading fidelity high Study strength high	n=300 10.7% to 69.3% 0.3
Autonomous penetration capability continues to improve alongside advances in overall model capability. Other	positive	relationship between overall model capability and autonomous penetration success	Reading fidelity medium Study strength medium	n=19 0.11