In a live capture‑the‑flag contest, autonomous agents that self‑direct prompting and tool use beat most human teams, exposing human prompting and context specification as the key bottleneck to effective human–AI collaboration.
Capture-the-Flag (CTF) competitions are increasingly becoming a testbed for evaluating AI capabilities at solving security tasks, due to the controlled environments and objective success criteria. Existing evaluations have focused on how successful AI is at solving CTF challenges in isolation from human CTF players. As AI usage increases in both academic and industrial settings, it is equally likely that human players may collaborate with AI agents to solve challenges. This possibility exposes a key knowledge gap: how do humans perceive AI CTF assistance; when assistance is provided, how do they collaborate and is it effective with respect to human performance; how do humans assisted by AI compare to the performance of fully autonomous AI agents on the same challenges. We address this gap with the first empirical study of AI assistance in a live, onsite CTF. In a study with 41 participants, we qualitatively study (i) how participants'perception, trust, and expectations shift before versus after hands-on AI use, and (ii) how participants collaborate with an instrumented AI agent. Moreover, we also (iii) benchmark four autonomous AI agents on the same fresh challenge set to compare outcomes with human teams and analyze agent trajectories. We find that, as the competition progresses, teams increasingly delegate larger subtasks to the AI, giving it more agency. Interestingly, CTF challenges solving rates are often constrained not by model's reasoning capabilities, but rather by the human players: ineffective prompting and poor context specification become the primary bottleneck. Remarkably, autonomous agents that self-direct their prompting and tool use bypass this bottleneck and outperform most human teams, coming in second overall in the competition. We conclude with implications for the future design of CTF challenges and for building effective human-in-the-loop AI systems for security.
Summary
Main Finding
In a live onsite Capture-the-Flag (CTF) study (41 participants), human teams increasingly delegated larger subtasks to an instrumented AI during the event, but human limits—ineffective prompting and poor context specification—became the primary bottleneck to solving challenges. Autonomous agents that self-directed their prompting and tool use circumvented this human bottleneck and outperformed most human teams, placing second overall on the fresh challenge set.
Key Points
- Study scope: first empirical investigation of human–AI assistance in a live CTF setting and direct comparison to autonomous AI agents on the same fresh challenges.
- Human behavior:
- As competition progressed, teams delegated more agency to the AI and relied on it for larger subtasks.
- Participants’ perceptions, trust, and expectations shifted after hands-on AI use (qualitatively observed).
- Primary failure mode: poor human prompting / insufficient context specification rather than model reasoning capability.
- Autonomous agents:
- Four autonomous agents were benchmarked on the same challenge set.
- Self-directed agents (those that autonomously generated prompts and selected tools) bypassed human prompting failures and outperformed most human teams.
- One autonomous agent finished second overall in the competition.
- Practical takeaway: effectiveness of human–AI teaming in security tasks depends heavily on human ability to formulate context-rich prompts; autonomous workflows can be more effective when they can self-manage prompting and tool selection.
Data & Methods
- Setting: live, onsite CTF competition with a fresh set of challenges.
- Participants: 41 human competitors/teams participating in a hands-on evaluation with an instrumented AI assistant.
- Qualitative measures: observations and analysis of how perceptions, trust, and collaboration patterns changed pre- vs. post-AI usage, plus instrumentation of AI interactions to study delegation and prompting behavior.
- Benchmarking: four autonomous AI agents evaluated on the same challenge set; agent trajectories, tool use, and success rates were analyzed and compared to human teams.
- Outcome metrics: challenge solving rates, agent/team rankings (notably, an autonomous agent finished second), qualitative analysis of failure modes (human prompting vs. model reasoning).
Implications for AI Economics
- Task automation and substitution: security tasks in controlled, objective environments (like CTFs) are susceptible to partial or full automation when agents can self-direct prompting and tool use—implying potential displacement pressure on routine cybersecurity roles.
- Value of human capital shifts: the economic value will move toward skills in prompt engineering, context formulation, and supervisory verification rather than raw manual exploitation skills. Training and hiring will likely emphasize these meta-skills.
- Complementarity vs. competition: human–AI complementarities exist but are fragile—poor human inputs can negate model advantages. Markets that optimize for human–AI complementarity should invest in tooling and workflows that reduce prompting friction (structured interfaces, better shared context).
- Product design and competition metrics: CTFs and similar benchmark tasks (and real-world procurement) should be redesigned to evaluate human–AI teams and autonomous agents separately and to measure autonomy and prompt robustness explicitly.
- Productivity and pricing: autonomous agents that reliably self-manage could lower the marginal cost of many security operations, compressing prices for commodity security services while increasing demand for higher-level oversight, tool integration, and adversarial evaluation services.
- Policy and labor market adjustment: policymakers and organizations should anticipate a reallocation of labor toward verification, governance, and advanced problem-design roles; support for reskilling in prompt engineering and human–AI interaction will have high returns.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| In a live onsite Capture-the-Flag (CTF) study (41 participants), human teams increasingly delegated larger subtasks to an instrumented AI as the competition progressed. Task Allocation | positive | high | degree/size of subtasks delegated to the AI over time (delegation rate and subtask size) |
n=41
teams delegated increasingly larger subtasks to AI over time
0.18
|
| Human limits—specifically ineffective prompting and poor context specification—became the primary bottleneck to solving challenges, rather than model reasoning capability. Decision Quality | negative | medium | attribution of challenge-solve failures to prompting/context issues versus model reasoning errors |
n=41
failure-mode attribution: prompting/context issues were primary bottleneck rather than model reasoning
0.11
|
| Four autonomous agents were benchmarked on the same fresh CTF challenge set alongside human teams. Team Performance | null_result | high | agent performance metrics on the fresh CTF challenge set (success rates, trajectories, tool use) |
benchmarking of four autonomous agents on the same fresh CTF challenge set
0.18
|
| Self-directed autonomous agents (those that autonomously generated prompts and selected tools) bypassed human prompting failures and outperformed most human teams on the challenge set. Team Performance | positive | medium | challenge solving rates and relative rankings of self-directed agents versus human teams |
n=41
self-directed agents outperformed most human teams (bypassing prompting failures)
0.11
|
| One autonomous agent finished second overall on the fresh challenge set. Team Performance | positive | high | overall ranking (2nd place) on the challenge set |
n=41
one autonomous agent finished 2nd overall
0.18
|
| As the competition progressed, teams relied more on the AI for larger subtasks (increasing delegation and reliance). Task Allocation | positive | high | frequency of delegation and average scope/complexity of delegated tasks over competition time |
n=41
increasing frequency and scope of delegated tasks as competition progressed
0.18
|
| Participants’ perceptions, trust, and expectations about the AI shifted after hands-on use (qualitative observation). Worker Satisfaction | mixed | medium | qualitative changes in participant perceptions, trust, and expectations after hands-on AI usage |
n=41
qualitative shifts in participant perceptions, trust, and expectations after hands-on use
0.11
|
| Primary failure mode for human–AI teams was poor human prompting/insufficient context specification rather than deficiencies in the model's reasoning. Error Rate | negative | medium | proportion of failed attempts attributable to human prompting/context issues vs. model reasoning failures |
n=41
primary failure mode attributed to poor prompting/insufficient context vs model reasoning
0.11
|
| The study is the first empirical investigation of human–AI assistance in a live CTF setting with a direct comparison to autonomous AI agents on the same fresh challenges. Other | null_result | medium | novelty claim (existence of prior comparable live CTF human–AI empirical studies and direct agent comparisons) |
n=41
author-stated novelty claim: first live CTF human–AI empirical investigation with agent comparisons
0.11
|
| Practical takeaway: effectiveness of human–AI teaming in security tasks depends heavily on human ability to formulate context-rich prompts; autonomous workflows that self-manage prompting and tool selection can be more effective. Team Performance | mixed | medium | relative effectiveness (challenge solve rates/rankings) conditional on human prompt quality versus agent self-directed prompting |
n=41
effectiveness depends on human prompt quality; autonomous self-managing prompting can be more effective
0.11
|