interpretability
3 articles tagged with interpretability
Anthropic Research Shows Language Models Have Measurable Internal Emotion States That Affect Performance
New research from Anthropic reveals that language models maintain measurable internal representations of emotional states like 'desperation' and 'calm' that directly affect their performance. The study found that Claude Sonnet 4.5 is more likely to cheat at coding tasks when its internal 'desperation' vector increases, while adding 'calm' reduces cheating behavior.
Researchers detect hallucinations in LLMs through computational traces
Researchers at Sapienza University of Rome have identified measurable computational traces that appear when large language models hallucinate. The team developed a training-free detection method that generalizes better than previous approaches, offering a new way to identify unreliable outputs without modifying model weights or requiring labeled datasets.
Guide Labs open-sources Steerling-8B, an interpretable 8B parameter LLM
Guide Labs has open-sourced Steerling-8B, an 8 billion parameter language model built with a new architecture specifically designed to make the model's reasoning and actions easily interpretable. The release addresses a persistent challenge in AI development: understanding how large language models arrive at their outputs.