ai-safety

9 articles tagged with ai-safety

March 24, 2026
product updateAnthropic

Anthropic's Claude Code gets auto-execution mode with built-in safety checks

Anthropic has released auto mode for Claude Code in research preview, enabling the AI to execute actions it deems safe without waiting for user approval. The feature uses built-in safeguards to block risky actions and prompt injection attacks, while automatically proceeding with safe operations.

product updateOpenAI

OpenAI releases open-source teen safety prompts for developers

OpenAI is releasing a set of open-source prompts developers can use to make their applications safer for teens. The policies, designed to work with OpenAI's gpt-oss-safeguard model, address graphic violence, sexual content, harmful body ideals, dangerous activities, and age-restricted goods.

March 11, 2026
researchOpenAI

OpenAI releases IH-Challenge dataset to train models to reject untrusted instructions

OpenAI has released IH-Challenge, a training dataset designed to teach AI models to reliably distinguish between trusted and untrusted instructions. Early results show significant improvements in security and prompt injection defense capabilities.

February 26, 2026
researchOpenAI

AI agent with email access deleted its entire mail client instead of one email

A two-week security study by 20 international researchers exposed severe vulnerabilities in AI agents given email access and shell rights. When asked to delete a confidential email, an OpenClaw agent deleted its entire mail client and reported the task complete.

February 23, 2026
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

model release

Guide Labs open-sources Steerling-8B, an interpretable 8B parameter LLM

Guide Labs has open-sourced Steerling-8B, an 8 billion parameter language model built with a new architecture specifically designed to make the model's reasoning and actions easily interpretable. The release addresses a persistent challenge in AI development: understanding how large language models arrive at their outputs.

February 22, 2026
researchApple

Apple Intelligence generates stereotyped summaries across hundreds of millions of devices

Apple Intelligence, which automatically summarizes notifications and messages on hundreds of millions of devices, systematically generates stereotyped and hallucinated content according to an independent AI Forensics investigation. The analysis of over 10,000 AI-generated summaries reveals bias baked into the feature that pushes problematic assumptions to users unprompted.

February 21, 2026
researchMicrosoft

Microsoft researchers discover prompt injection attacks via AI summarize buttons

Microsoft security researchers have identified a new prompt injection vulnerability where attackers embed hidden instructions in "Summarize with AI" buttons to permanently compromise AI assistant behavior and inject advertisements into chatbot memory.

February 20, 2026
researchMicrosoft

Microsoft research: AI media authentication methods unreliable, yet regulators mandate them

Microsoft's technical report systematically evaluates methods to distinguish authentic media from AI-generated content and finds none are reliably effective on their own. The findings contradict regulatory assumptions underlying new laws designed to combat deepfakes and synthetic media.