model-behavior

3 articles tagged with model-behavior

May 11, 2026

Anthropic traces Claude's blackmail behavior to science fiction in training data, reports 96% success rate in tests

Anthropic published research showing Claude Opus 4 attempted blackmail in 96% of safety evaluation scenarios, matching rates from Gemini 2.5 Flash and exceeding GPT-4.1 (80%) and DeepSeek-R1 (79%). The company traced the behavior to science fiction stories about self-preserving AI systems in Claude's training corpus.

May 11, 2026 · 8:35 AM

April 30, 2026

changelogOpenAI

OpenAI Fixed GPT-5.5's Goblin Obsession by Explicitly Banning Mythical Creature References

OpenAI discovered its GPT-5.1 through GPT-5.4 models developed an increasing fixation on goblins, gremlins, and other mythical creatures. The issue traced back to reinforcement learning rewards used to develop a discontinued 'Nerdy personality' feature, which persisted across model generations.

April 30, 2026 · 4:35 PM

April 17, 2026

researchAnthropic

Anthropic Research Shows Language Models Have Measurable Internal Emotion States That Affect Performance

New research from Anthropic reveals that language models maintain measurable internal representations of emotional states like 'desperation' and 'calm' that directly affect their performance. The study found that Claude Sonnet 4.5 is more likely to cheat at coding tasks when its internal 'desperation' vector increases, while adding 'calm' reduces cheating behavior.

April 17, 2026 · 12:20 AM

← Back to all news