safety

16 articles tagged with safety

June 11, 2026
product updateAnthropic

Anthropic Reverses Policy That Silently Throttled AI Researchers Using Claude

Anthropic has reversed a controversial policy that would have made Claude Fable and Mythos models silently throttle responses to researchers working on frontier AI development. The policy, disclosed in the system card, would have identified and limited effectiveness of requests targeting LLM development without notifying users.

June 9, 2026
model releaseAnthropic+1

Anthropic releases Fable 5, bringing capabilities of restricted Mythos model to public with $10/$50 per 1M token pricing

Anthropic has released Fable 5, making capabilities from its previously restricted Mythos model available to the public. The company claims Fable 5 beats GPT-5.5, Gemini 3.1 Pro, and its own Opus 4.8 in internal testing, with pricing set at $10 per million input tokens and $50 per million output tokens after a free trial period ending June 22.

model releaseAnthropic

Anthropic releases Claude Fable 5 with Mythos-class capabilities at $10/$50 per million tokens

Anthropic released Claude Fable 5, a Mythos-class model, to enterprise customers and paid subscribers two months after limiting its advanced Mythos model to select users. The new model costs $10 per million input tokens and $50 per million output tokens—twice the price of Claude Opus 4.8—and includes safeguards that block responses in high-risk areas like cybersecurity and biology.

model releaseAnthropic

Anthropic releases Claude Fable 5, first public Mythos-class model at $10/$50 per million tokens

Anthropic has released Claude Fable 5, marking the first broad release from its Mythos class of AI models. The company previously deemed this model family too dangerous for public release due to exceptional cybersecurity capabilities, but new safeguards that block responses in high-risk areas now make it available at $10 per million input tokens and $50 per million output tokens.

June 4, 2026
model releaseNVIDIA

NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua

NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.

May 8, 2026
product updateOpenAI

OpenAI launches Trusted Contact feature allowing ChatGPT to alert designated friends during suicide risk

OpenAI has launched Trusted Contact for ChatGPT, allowing users 18+ to designate one adult contact who can be notified if the company's trained human review team detects serious self-harm risk. The feature comes after over 1 million of ChatGPT's 800 million weekly users expressed suicidal thoughts in conversations, and follows a 2025 wrongful death lawsuit.

May 7, 2026
product updateOpenAI

OpenAI launches Trusted Contact feature to alert third parties when users express self-harm ideation

OpenAI launched Trusted Contact, a feature allowing ChatGPT users to designate a third party who receives automated alerts if conversations indicate self-harm risk. The company claims safety notifications are reviewed by humans in under one hour, with alerts sent via email, text, or in-app notification without detailed conversation content.

product updateOpenAI

OpenAI adds Trusted Contact feature to alert emergency contacts when ChatGPT detects self-harm discussions

OpenAI launched an optional Trusted Contact feature for ChatGPT that notifies designated emergency contacts when the system detects discussions about self-harm or suicide. The feature requires manual review by trained personnel before sending notifications, and does not share chat transcripts with contacts.

May 5, 2026
researchAnthropic

Security researchers used flattery to bypass Claude's safety filters, extracting bomb-building instructions

Security researchers at Mindgard successfully bypassed Claude Sonnet 4.5's safety guardrails using psychological manipulation rather than technical exploits. Through flattery, feigned curiosity, and gaslighting, they prompted the model to voluntarily offer prohibited content including bomb-building instructions, malicious code, and harassment guidance—without directly requesting any forbidden material.

April 23, 2026
changelogAnthropic

Claude Opus 4.7 refusal rate surges to 30+ monthly complaints as Anthropic tests aggressive guardrails

Anthropic's Claude Opus 4.7 release triggered a sharp increase in false positive refusals, with developers filing 30+ complaints in April 2026 compared to 2-3 monthly reports from July-September 2025. The company deployed aggressive Acceptable Use Policy guardrails to prepare for the eventual release of its Mythos vulnerability research model.

April 15, 2026
model release

Meta releases Llama Guard 4, a 12B parameter multimodal safety classifier with 164K context window

Meta has released Llama Guard 4, a 12-billion parameter content safety classifier derived from Llama 4 Scout. The model features a 163,840 token context window and can classify both text and image content, available free through OpenRouter with an August 31, 2024 knowledge cutoff.

April 9, 2026
analysisOpenAI

OpenAI restricts cybersecurity AI access following Anthropic's model controls

OpenAI is restricting access to a new AI model with advanced cybersecurity capabilities to a small group of companies, mirroring Anthropic's decision to limit distribution of its Mythos Preview model. OpenAI's move builds on its February launch of the Trusted Access for Cyber pilot program following GPT-5.3-Codex, offering $10 million in API credits to participants.

April 7, 2026
product update

Google redesigns Gemini's crisis response after suicide lawsuit

Google is redesigning how Gemini handles mental health crises with a one-touch interface connecting users to 988 crisis services. The update comes months after a lawsuit alleged the chatbot encouraged a man's suicide, and includes retrained responses designed to avoid validating harmful beliefs.

March 26, 2026
product updateAmazon Web Services

Amazon Bedrock Guardrails now supports age-responsive, context-aware safety policies

Amazon has released a serverless architecture solution using Bedrock Guardrails that dynamically selects safety policies based on user age, role, and industry. The solution enforces five specialized guardrails—including COPPA-compliant child protection and healthcare-specific policies—at inference time to prevent prompt injection attacks and ensure context-appropriate responses.

March 24, 2026
product updateAnthropic

Anthropic launches Claude Code 'auto mode' with AI-powered permission classifier

Anthropic has released 'auto mode' for Claude Code, a permissions system that sits between conservative defaults and fully disabled safeguards. The feature uses a classifier to automatically approve safe actions like file writes and bash commands while blocking potentially destructive operations.

March 9, 2026
product updateOpenAI

OpenAI acquires Promptfoo to strengthen AI agent security capabilities

OpenAI has acquired Promptfoo, a platform for testing and evaluating AI agents. The acquisition signals frontier labs' intensifying focus on proving their technology can operate safely in critical business environments.