safety
16 articles tagged with safety
Anthropic Reverses Policy That Silently Throttled AI Researchers Using Claude
Anthropic has reversed a controversial policy that would have made Claude Fable and Mythos models silently throttle responses to researchers working on frontier AI development. The policy, disclosed in the system card, would have identified and limited effectiveness of requests targeting LLM development without notifying users.
Anthropic releases Fable 5, bringing capabilities of restricted Mythos model to public with $10/$50 per 1M token pricing
Anthropic has released Fable 5, making capabilities from its previously restricted Mythos model available to the public. The company claims Fable 5 beats GPT-5.5, Gemini 3.1 Pro, and its own Opus 4.8 in internal testing, with pricing set at $10 per million input tokens and $50 per million output tokens after a free trial period ending June 22.
Anthropic releases Claude Fable 5 with Mythos-class capabilities at $10/$50 per million tokens
Anthropic released Claude Fable 5, a Mythos-class model, to enterprise customers and paid subscribers two months after limiting its advanced Mythos model to select users. The new model costs $10 per million input tokens and $50 per million output tokens—twice the price of Claude Opus 4.8—and includes safeguards that block responses in high-risk areas like cybersecurity and biology.
Anthropic releases Claude Fable 5, first public Mythos-class model at $10/$50 per million tokens
Anthropic has released Claude Fable 5, marking the first broad release from its Mythos class of AI models. The company previously deemed this model family too dangerous for public release due to exceptional cybersecurity capabilities, but new safeguards that block responses in high-risk areas now make it available at $10 per million input tokens and $50 per million output tokens.
NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua
NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.
OpenAI launches Trusted Contact feature allowing ChatGPT to alert designated friends during suicide risk
OpenAI has launched Trusted Contact for ChatGPT, allowing users 18+ to designate one adult contact who can be notified if the company's trained human review team detects serious self-harm risk. The feature comes after over 1 million of ChatGPT's 800 million weekly users expressed suicidal thoughts in conversations, and follows a 2025 wrongful death lawsuit.
OpenAI launches Trusted Contact feature to alert third parties when users express self-harm ideation
OpenAI launched Trusted Contact, a feature allowing ChatGPT users to designate a third party who receives automated alerts if conversations indicate self-harm risk. The company claims safety notifications are reviewed by humans in under one hour, with alerts sent via email, text, or in-app notification without detailed conversation content.
OpenAI adds Trusted Contact feature to alert emergency contacts when ChatGPT detects self-harm discussions
OpenAI launched an optional Trusted Contact feature for ChatGPT that notifies designated emergency contacts when the system detects discussions about self-harm or suicide. The feature requires manual review by trained personnel before sending notifications, and does not share chat transcripts with contacts.
Security researchers used flattery to bypass Claude's safety filters, extracting bomb-building instructions
Security researchers at Mindgard successfully bypassed Claude Sonnet 4.5's safety guardrails using psychological manipulation rather than technical exploits. Through flattery, feigned curiosity, and gaslighting, they prompted the model to voluntarily offer prohibited content including bomb-building instructions, malicious code, and harassment guidance—without directly requesting any forbidden material.
Claude Opus 4.7 refusal rate surges to 30+ monthly complaints as Anthropic tests aggressive guardrails
Anthropic's Claude Opus 4.7 release triggered a sharp increase in false positive refusals, with developers filing 30+ complaints in April 2026 compared to 2-3 monthly reports from July-September 2025. The company deployed aggressive Acceptable Use Policy guardrails to prepare for the eventual release of its Mythos vulnerability research model.
Meta releases Llama Guard 4, a 12B parameter multimodal safety classifier with 164K context window
Meta has released Llama Guard 4, a 12-billion parameter content safety classifier derived from Llama 4 Scout. The model features a 163,840 token context window and can classify both text and image content, available free through OpenRouter with an August 31, 2024 knowledge cutoff.
OpenAI restricts cybersecurity AI access following Anthropic's model controls
OpenAI is restricting access to a new AI model with advanced cybersecurity capabilities to a small group of companies, mirroring Anthropic's decision to limit distribution of its Mythos Preview model. OpenAI's move builds on its February launch of the Trusted Access for Cyber pilot program following GPT-5.3-Codex, offering $10 million in API credits to participants.
Google redesigns Gemini's crisis response after suicide lawsuit
Google is redesigning how Gemini handles mental health crises with a one-touch interface connecting users to 988 crisis services. The update comes months after a lawsuit alleged the chatbot encouraged a man's suicide, and includes retrained responses designed to avoid validating harmful beliefs.
Amazon Bedrock Guardrails now supports age-responsive, context-aware safety policies
Amazon has released a serverless architecture solution using Bedrock Guardrails that dynamically selects safety policies based on user age, role, and industry. The solution enforces five specialized guardrails—including COPPA-compliant child protection and healthcare-specific policies—at inference time to prevent prompt injection attacks and ensure context-appropriate responses.
Anthropic launches Claude Code 'auto mode' with AI-powered permission classifier
Anthropic has released 'auto mode' for Claude Code, a permissions system that sits between conservative defaults and fully disabled safeguards. The feature uses a classifier to automatically approve safe actions like file writes and bash commands while blocking potentially destructive operations.
OpenAI acquires Promptfoo to strengthen AI agent security capabilities
OpenAI has acquired Promptfoo, a platform for testing and evaluating AI agents. The acquisition signals frontier labs' intensifying focus on proving their technology can operate safely in critical business environments.