research

AI agent outperforms 9 of 10 human hackers in live penetration testing study

A new AI agent framework called ARTEMIS discovered 9 valid vulnerabilities in live penetration testing against a university network with ~8,000 hosts, outperforming 9 of 10 human cybersecurity professionals. The system achieved an 82% valid submission rate and costs $18/hour compared to $60/hour for professional penetration testers, though it struggles with GUI-based tasks and produces higher false-positive rates.

3 min read

AI Agent Outperforms Most Human Hackers in Live Penetration Testing

Researchers have published the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment, and the results challenge conventional assumptions about current AI limitations in offensive security.

The study evaluated ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, a new multi-agent framework, on a large university network consisting of approximately 8,000 hosts across 12 subnets. ARTEMIS discovered 9 valid vulnerabilities with an 82% valid submission rate and ranked second overall, outperforming 9 of the 10 human participants.

ARTEMIS Framework Architecture

ARTEMIS differentiates itself through three core features: dynamic prompt generation, arbitrary sub-agents for specialized tasks, and automatic vulnerability triaging. The system coordinates multiple specialized AI agents to systematically enumerate targets, parallelize exploitation attempts, and prioritize findings.

Competing existing AI scaffolds including Codex and CyAgent underperformed relative to most human participants, suggesting that architecture and coordination mechanisms significantly impact agent effectiveness in offensive security work.

Performance and Economics

Beyond capabilities, the study quantifies a substantial cost differential. Certain ARTEMIS variants cost $18 per hour of penetration testing versus $60 per hour for professional penetration testers—a 70% cost reduction that could reshape the economics of security assessment, assuming quality parity.

AI agents demonstrated specific technical advantages in three areas: systematic enumeration (methodical scanning and asset discovery without human fatigue), parallel exploitation (running multiple exploitation attempts simultaneously), and cost efficiency. These strengths align with task domains where parallelization and deterministic methodology provide advantages.

Critical Capability Gaps

The study identifies two major weakness categories that prevent AI agents from fully replacing human professionals. First, AI agents exhibit higher false-positive rates—incorrectly flagging benign activity as vulnerabilities—requiring human review to validate findings. Second, agents struggle significantly with GUI-based tasks, limiting their effectiveness against systems requiring graphical interaction or visual analysis.

These gaps suggest that current AI agents operate effectively in command-line and API-driven environments but lack the visual reasoning and interactive capabilities needed for comprehensive security assessments.

Implications for Security Teams

The results suggest a hybrid model where AI agents handle systematic enumeration and straightforward exploitation while human professionals focus on validation, false-positive filtering, and visual-interface testing. This division of labor could accelerate penetration testing cycles while maintaining quality standards.

The 82% valid submission rate from ARTEMIS indicates submission quality comparable to the strongest human participants, addressing potential concerns about AI-generated false findings.

What This Means

This is the first empirical evidence at scale that current AI agents can perform at or above the 50th percentile of professional penetration testers on real enterprise networks. The relevance extends beyond offensive security—it demonstrates that multi-agent frameworks with proper coordination can approach specialized professional performance in complex technical domains. The $18/hour cost point makes AI-augmented security assessment economically viable for organizations that currently cannot afford continuous penetration testing. However, the persistent gaps in false-positive discrimination and GUI interaction mean human oversight remains non-negotiable. The real-world impact depends on how security teams operationalize these findings—full replacement is unlikely, but significant augmentation of human-led assessments appears immediately practical.

AI Agents vs Cybersecurity Pros: Penetration Testing Study | TPS