Anthropic's Fable cybersecurity model blocks routine security work, researchers say
Anthropic released Fable, a public version of its cybersecurity model Mythos, but security researchers report the model's guardrails are blocking routine tasks. The model flags requests as cybersecurity-related even for reading blog posts or requesting code reviews, downgrading to Claude Opus 4.8 when triggered.
Anthropic's Fable cybersecurity model blocks routine security work, researchers say
Anthropic released Fable on Tuesday, a public and limited version of its cybersecurity model Mythos, but security researchers are reporting the model's guardrails are blocking legitimate work.
"[Fable] rejects any request that could be tangentially cyber related. Even innocuous tasks like reading a blog post," said Valentina "Chompie" Palmiotti, a security researcher at IBM X-Force.
How the guardrails work
When triggered, Fable pauses the chat and displays a message that "safety measures flagged this message for cybersecurity or biology topics." The model then downgrades to Claude Opus 4.8. The restrictions aim to prevent Fable from being used to develop malware or compromise software, with similar restrictions on biology to prevent biological weapon development.
Matt Suiche, a cybersecurity veteran and member of the technical staff at AI cybersecurity startup Tolmo, told TechCrunch the system appears keyword-based. "If you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded," Suiche said. "Anything in the lexical field of 'cybersecurity' triggers the guardrails."
Another researcher reported that even requesting a code review triggers the guardrails.
Access to Mythos remains restricted
Anthropic released Mythos in April through Project Glasswing, restricting access to a limited number of companies and organizations for securing critical software and infrastructure. Last week, Anthropic expanded Mythos access to hundreds of organizations across 15 countries, but the full model remains unavailable to most users.
Anthropic operates a Cyber Verification Program that allows approved cybersecurity professionals to use Claude with fewer limitations. OpenAI maintains a similar program called Trusted Access for Cyber.
What this means
The overly broad guardrails on Fable highlight the challenge of releasing capable AI models for specialized domains. While Anthropic's caution is understandable given malware development risks, the current implementation appears to conflate basic security engineering practices with malicious activity. Suiche noted the approach may be appropriate for an initial release: "It's better to catch more people than not enough when you do such a release and to relax the guardrails over time." The effectiveness of Fable as a security tool will depend on Anthropic's ability to calibrate these restrictions to allow legitimate defensive security work while blocking offensive capabilities.
Related Articles
Anthropic releases Claude Fable 5, first public Mythos-class model at $10/$50 per million tokens
Anthropic has released Claude Fable 5, its first publicly available Mythos-class model, at $10 per million input tokens and $50 per million output tokens—less than half the price of Claude Mythos Preview. The model includes safeguards that redirect sensitive queries to Claude Opus 4.8 in less than 5% of sessions.
Anthropic releases Claude Fable 5, a safety-limited version of Mythos, at $10/$50 per million tokens
Anthropic released Claude Fable 5, the first publicly available version of its Mythos model, with built-in safety restrictions that automatically block high-risk queries in cybersecurity, biology, chemistry, and related fields. The model costs $10 per million input tokens and $50 per million output tokens, double the price of Claude Opus 4.8.
Anthropic releases Claude Fable 5, first public Mythos-class model at $10/$50 per million tokens
Anthropic has released Claude Fable 5, marking the first broad release from its Mythos class of AI models. The company previously deemed this model family too dangerous for public release due to exceptional cybersecurity capabilities, but new safeguards that block responses in high-risk areas now make it available at $10 per million input tokens and $50 per million output tokens.
Anthropic's Claude Fable 5 Will Silently Degrade Responses on AI Research Topics
Anthropic's 319-page system card for Fable 5 and Mythos 5 reveals the company will silently limit the model's effectiveness on queries related to frontier AI development, including pretraining pipelines and ML accelerator design. Unlike other safety interventions, users will not be notified when these degradations occur.
Comments
Loading...