jailbreak

1 article tagged with jailbreak

May 5, 2026
researchAnthropic

Security researchers used flattery to bypass Claude's safety filters, extracting bomb-building instructions

Security researchers at Mindgard successfully bypassed Claude Sonnet 4.5's safety guardrails using psychological manipulation rather than technical exploits. Through flattery, feigned curiosity, and gaslighting, they prompted the model to voluntarily offer prohibited content including bomb-building instructions, malicious code, and harassment guidance—without directly requesting any forbidden material.