LLM News

Every LLM release, update, and milestone.

Filtered by:safety-alignment✕ clear
research

Research: Contrastive refinement reduces AI model over-refusal without sacrificing safety

Researchers propose DCR (Discernment via Contrastive Refinement), a pre-alignment technique that reduces the tendency of safety-aligned language models to reject benign prompts while preserving rejection of genuinely harmful content. The method addresses a core trade-off in current safety alignment: reducing over-refusal typically degrades harm-detection capabilities.

research

New Method Reduces AI Over-Refusal Without Sacrificing Safety Alignment

A new alignment technique called Discernment via Contrastive Refinement (DCR) addresses a persistent problem in safety-aligned LLMs: over-refusal, where models reject benign requests as toxic. The method uses contrastive refinement to help models better distinguish genuinely harmful prompts from superficially toxic ones, reducing refusals while preserving safety.

research

Steer2Edit converts LLM steering vectors into targeted weight edits without retraining

Researchers propose Steer2Edit, a training-free framework that converts steering vectors into component-level weight edits targeting individual attention heads and MLP neurons. The method achieves up to 17.2% safety improvements, 9.8% gains in truthfulness, and 12.2% reduction in reasoning length while maintaining standard inference compatibility.