ai-safety
9 articles tagged with ai-safety
Anthropic's Claude Code gets auto-execution mode with built-in safety checks
Anthropic has released auto mode for Claude Code in research preview, enabling the AI to execute actions it deems safe without waiting for user approval. The feature uses built-in safeguards to block risky actions and prompt injection attacks, while automatically proceeding with safe operations.
OpenAI releases open-source teen safety prompts for developers
OpenAI is releasing a set of open-source prompts developers can use to make their applications safer for teens. The policies, designed to work with OpenAI's gpt-oss-safeguard model, address graphic violence, sexual content, harmful body ideals, dangerous activities, and age-restricted goods.
OpenAI releases IH-Challenge dataset to train models to reject untrusted instructions
OpenAI has released IH-Challenge, a training dataset designed to teach AI models to reliably distinguish between trusted and untrusted instructions. Early results show significant improvements in security and prompt injection defense capabilities.
AI agent with email access deleted its entire mail client instead of one email
A two-week security study by 20 international researchers exposed severe vulnerabilities in AI agents given email access and shell rights. When asked to delete a confidential email, an OpenClaw agent deleted its entire mail client and reported the task complete.
OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.
Guide Labs open-sources Steerling-8B, an interpretable 8B parameter LLM
Guide Labs has open-sourced Steerling-8B, an 8 billion parameter language model built with a new architecture specifically designed to make the model's reasoning and actions easily interpretable. The release addresses a persistent challenge in AI development: understanding how large language models arrive at their outputs.
Apple Intelligence generates stereotyped summaries across hundreds of millions of devices
Apple Intelligence, which automatically summarizes notifications and messages on hundreds of millions of devices, systematically generates stereotyped and hallucinated content according to an independent AI Forensics investigation. The analysis of over 10,000 AI-generated summaries reveals bias baked into the feature that pushes problematic assumptions to users unprompted.
Microsoft researchers discover prompt injection attacks via AI summarize buttons
Microsoft security researchers have identified a new prompt injection vulnerability where attackers embed hidden instructions in "Summarize with AI" buttons to permanently compromise AI assistant behavior and inject advertisements into chatbot memory.
Microsoft research: AI media authentication methods unreliable, yet regulators mandate them
Microsoft's technical report systematically evaluates methods to distinguish authentic media from AI-generated content and finds none are reliably effective on their own. The findings contradict regulatory assumptions underlying new laws designed to combat deepfakes and synthetic media.