In a live capture‑the‑flag contest, autonomous agents that self‑direct prompting and tool use beat most human teams, exposing human prompting and context specification as the key bottleneck to effective human–AI collaboration.

Understanding Human-AI Collaboration in Cybersecurity Competitions

Tingxuan Tang, N. Janis, Kalyn Asher Montague, Kevin Eykholt, Dhilung Kirat, Youngja Park, Jiyong Jang, Adwait Nadkarni, Yue Xiao · Fetched March 18, 2026

semantic_scholar descriptive medium evidence 8/10 relevance Source

In a live CTF, human teams delegated increasingly to an instrumented AI but were bottlenecked by poor prompting and context specification, while self-directed autonomous agents that generated their own prompts outperformed most human teams and placed second overall.

Capture-the-Flag (CTF) competitions are increasingly becoming a testbed for evaluating AI capabilities at solving security tasks, due to the controlled environments and objective success criteria. Existing evaluations have focused on how successful AI is at solving CTF challenges in isolation from human CTF players. As AI usage increases in both academic and industrial settings, it is equally likely that human players may collaborate with AI agents to solve challenges. This possibility exposes a key knowledge gap: how do humans perceive AI CTF assistance; when assistance is provided, how do they collaborate and is it effective with respect to human performance; how do humans assisted by AI compare to the performance of fully autonomous AI agents on the same challenges. We address this gap with the first empirical study of AI assistance in a live, onsite CTF. In a study with 41 participants, we qualitatively study (i) how participants'perception, trust, and expectations shift before versus after hands-on AI use, and (ii) how participants collaborate with an instrumented AI agent. Moreover, we also (iii) benchmark four autonomous AI agents on the same fresh challenge set to compare outcomes with human teams and analyze agent trajectories. We find that, as the competition progresses, teams increasingly delegate larger subtasks to the AI, giving it more agency. Interestingly, CTF challenges solving rates are often constrained not by model's reasoning capabilities, but rather by the human players: ineffective prompting and poor context specification become the primary bottleneck. Remarkably, autonomous agents that self-direct their prompting and tool use bypass this bottleneck and outperform most human teams, coming in second overall in the competition. We conclude with implications for the future design of CTF challenges and for building effective human-in-the-loop AI systems for security.

Summary

Main Finding

In a live onsite Capture-the-Flag (CTF) study (41 participants), human teams increasingly delegated larger subtasks to an instrumented AI during the event, but human limits—ineffective prompting and poor context specification—became the primary bottleneck to solving challenges. Autonomous agents that self-directed their prompting and tool use circumvented this human bottleneck and outperformed most human teams, placing second overall on the fresh challenge set.

Key Points

Study scope: first empirical investigation of human–AI assistance in a live CTF setting and direct comparison to autonomous AI agents on the same fresh challenges.
Human behavior:
- As competition progressed, teams delegated more agency to the AI and relied on it for larger subtasks.
- Participants’ perceptions, trust, and expectations shifted after hands-on AI use (qualitatively observed).
- Primary failure mode: poor human prompting / insufficient context specification rather than model reasoning capability.
Autonomous agents:
- Four autonomous agents were benchmarked on the same challenge set.
- Self-directed agents (those that autonomously generated prompts and selected tools) bypassed human prompting failures and outperformed most human teams.
- One autonomous agent finished second overall in the competition.
Practical takeaway: effectiveness of human–AI teaming in security tasks depends heavily on human ability to formulate context-rich prompts; autonomous workflows can be more effective when they can self-manage prompting and tool selection.

Data & Methods

Setting: live, onsite CTF competition with a fresh set of challenges.
Participants: 41 human competitors/teams participating in a hands-on evaluation with an instrumented AI assistant.
Qualitative measures: observations and analysis of how perceptions, trust, and collaboration patterns changed pre- vs. post-AI usage, plus instrumentation of AI interactions to study delegation and prompting behavior.
Benchmarking: four autonomous AI agents evaluated on the same challenge set; agent trajectories, tool use, and success rates were analyzed and compared to human teams.
Outcome metrics: challenge solving rates, agent/team rankings (notably, an autonomous agent finished second), qualitative analysis of failure modes (human prompting vs. model reasoning).

Implications for AI Economics

Task automation and substitution: security tasks in controlled, objective environments (like CTFs) are susceptible to partial or full automation when agents can self-direct prompting and tool use—implying potential displacement pressure on routine cybersecurity roles.
Value of human capital shifts: the economic value will move toward skills in prompt engineering, context formulation, and supervisory verification rather than raw manual exploitation skills. Training and hiring will likely emphasize these meta-skills.
Complementarity vs. competition: human–AI complementarities exist but are fragile—poor human inputs can negate model advantages. Markets that optimize for human–AI complementarity should invest in tooling and workflows that reduce prompting friction (structured interfaces, better shared context).
Product design and competition metrics: CTFs and similar benchmark tasks (and real-world procurement) should be redesigned to evaluate human–AI teams and autonomous agents separately and to measure autonomy and prompt robustness explicitly.
Productivity and pricing: autonomous agents that reliably self-manage could lower the marginal cost of many security operations, compressing prices for commodity security services while increasing demand for higher-level oversight, tool integration, and adversarial evaluation services.
Policy and labor market adjustment: policymakers and organizations should anticipate a reallocation of labor toward verification, governance, and advanced problem-design roles; support for reskilling in prompt engineering and human–AI interaction will have high returns.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Head-to-head, contemporaneous measurement of human teams and multiple autonomous agents on the same fresh CTF challenge set provides direct, concrete evidence about comparative performance in that task domain; however, the sample is small (41 participants), non-random, task-specific (CTF puzzles), and short-term, limiting the strength of any broader causal claims about automation or labor displacement. Methods Rigormedium — The study uses real-world, instrumented behavioral data and objective outcome metrics (solve rates, rankings), plus systematic benchmarking of four autonomous agents and qualitative analysis of failure modes; but it lacks randomization or experimental manipulation, has limited sample size and participant selection information, and appears vulnerable to confounders (skill heterogeneity, novelty effects, contest incentives) and limited reproducibility without more detail on agent implementations and challenge selection. SampleLive, onsite Capture-the-Flag competition with 41 human competitors/teams using an instrumented AI assistant on a fresh set of challenges; four autonomous AI agents were run on the same challenge set for benchmarking; data include challenge solve rates, team/agent rankings, instrumented logs of AI interactions and prompting behavior, and qualitative observations/interviews about perceptions and trust. Themeshuman_ai_collab productivity skills_training GeneralizabilitySmall sample size (41 participants) limits statistical power and representativeness, Contest CTF tasks are narrow and stylized proxies for cybersecurity work and may not reflect operational, long-duration security tasks, Participants likely self-selected and may be more skilled or motivated than average cybersecurity workers, Short-term, single-event setting—novelty and learning effects may bias behavior and outcomes, Only four autonomous agent architectures/tools tested—results may not generalize to other models, toolchains, or deployment contexts, Onsite/lab environment differs from distributed, real-world workflows and organizational constraints, Outcome metrics (challenge solves/rankings) capture performance in puzzle-like tasks but not supervision, verification, or cost of false positives/negatives in real operations

Claims (10)

Claim	Direction	Confidence	Outcome	Details
In a live onsite Capture-the-Flag (CTF) study (41 participants), human teams increasingly delegated larger subtasks to an instrumented AI as the competition progressed. Task Allocation	positive	high	degree/size of subtasks delegated to the AI over time (delegation rate and subtask size)	n=41 teams delegated increasingly larger subtasks to AI over time 0.18
Human limits—specifically ineffective prompting and poor context specification—became the primary bottleneck to solving challenges, rather than model reasoning capability. Decision Quality	negative	medium	attribution of challenge-solve failures to prompting/context issues versus model reasoning errors	n=41 failure-mode attribution: prompting/context issues were primary bottleneck rather than model reasoning 0.11
Four autonomous agents were benchmarked on the same fresh CTF challenge set alongside human teams. Team Performance	null_result	high	agent performance metrics on the fresh CTF challenge set (success rates, trajectories, tool use)	benchmarking of four autonomous agents on the same fresh CTF challenge set 0.18
Self-directed autonomous agents (those that autonomously generated prompts and selected tools) bypassed human prompting failures and outperformed most human teams on the challenge set. Team Performance	positive	medium	challenge solving rates and relative rankings of self-directed agents versus human teams	n=41 self-directed agents outperformed most human teams (bypassing prompting failures) 0.11
One autonomous agent finished second overall on the fresh challenge set. Team Performance	positive	high	overall ranking (2nd place) on the challenge set	n=41 one autonomous agent finished 2nd overall 0.18
As the competition progressed, teams relied more on the AI for larger subtasks (increasing delegation and reliance). Task Allocation	positive	high	frequency of delegation and average scope/complexity of delegated tasks over competition time	n=41 increasing frequency and scope of delegated tasks as competition progressed 0.18
Participants’ perceptions, trust, and expectations about the AI shifted after hands-on use (qualitative observation). Worker Satisfaction	mixed	medium	qualitative changes in participant perceptions, trust, and expectations after hands-on AI usage	n=41 qualitative shifts in participant perceptions, trust, and expectations after hands-on use 0.11
Primary failure mode for human–AI teams was poor human prompting/insufficient context specification rather than deficiencies in the model's reasoning. Error Rate	negative	medium	proportion of failed attempts attributable to human prompting/context issues vs. model reasoning failures	n=41 primary failure mode attributed to poor prompting/insufficient context vs model reasoning 0.11
The study is the first empirical investigation of human–AI assistance in a live CTF setting with a direct comparison to autonomous AI agents on the same fresh challenges. Other	null_result	medium	novelty claim (existence of prior comparable live CTF human–AI empirical studies and direct agent comparisons)	n=41 author-stated novelty claim: first live CTF human–AI empirical investigation with agent comparisons 0.11
Practical takeaway: effectiveness of human–AI teaming in security tasks depends heavily on human ability to formulate context-rich prompts; autonomous workflows that self-manage prompting and tool selection can be more effective. Team Performance	mixed	medium	relative effectiveness (challenge solve rates/rankings) conditional on human prompt quality versus agent self-directed prompting	n=41 effectiveness depends on human prompt quality; autonomous self-managing prompting can be more effective 0.11