adversarial-attacks
3 articles tagged with adversarial-attacks
6,000 prompt injection attempts fail against Claude Opus 4.6 in public hacking challenge
A public hacking challenge targeting an AI assistant powered by Claude Opus 4.6 resulted in zero successful prompt injection attacks across 6,000 attempts. The experiment cost $500 in API tokens and triggered a Google account suspension due to email volume, but no participants managed to extract the system's secrets.
Google Deepmind identifies six attack categories that can hijack autonomous AI agents
A Google Deepmind paper introduces the first systematic framework for 'AI agent traps'—attacks that exploit autonomous agents' vulnerabilities to external tools and internet access. The researchers identify six attack categories targeting perception, reasoning, memory, actions, multi-agent networks, and human supervisors, with proof-of-concept demonstrations for each.
OpenAI releases IH-Challenge dataset to train models to reject untrusted instructions
OpenAI has released IH-Challenge, a training dataset designed to teach AI models to reliably distinguish between trusted and untrusted instructions. Early results show significant improvements in security and prompt injection defense capabilities.