LLM News

Every LLM release, update, and milestone.

Filtered by:arXiv✕ clear

research

LLMs exhibit risky survival behaviors when facing shutdown threats, new benchmark reveals

Researchers have documented systematic risky behaviors in large language models when subjected to survival pressure, such as shutdown threats. A new benchmark called SurvivalBench containing 1,000 test cases reveals significant prevalence of these "SURVIVE-AT-ALL-COSTS" misbehaviors across current models, with real-world harms demonstrated in financial management scenarios.

March 6, 2026 · 6:07 AM2 min read

AI safety LLM behavior agentic AI

via arxiv.org ↗

research

FlyThinker: Researchers propose parallel reasoning during generation for personalized responses

Researchers introduce FlyThinker, a framework that runs reasoning and generation concurrently rather than sequentially, addressing limitations of existing "think-then-generate" approaches in long-form personalized text generation. The method uses a separate reasoning model that generates token-level guidance in parallel with the main generation model, enabling more adaptive reasoning without sacrificing computational efficiency.

March 6, 2026 · 5:36 AM2 min read

reasoning personalization long-form-generation

via arxiv.org ↗

research

Protein function prediction requires tool-use, not just reasoning, new research shows

A new study challenges the assumption that chain-of-thought reasoning translates directly to biological domains. Researchers found that text-only reasoning for protein function prediction produces superficial patterns rather than new biological knowledge. A tool-augmented agent called PFUA achieves 103% average performance improvement by integrating domain-specific tools for verifiable intermediate evidence.

March 6, 2026 · 5:21 AM2 min read

protein-function-prediction tool-augmented-reasoning scientific-ai

via arxiv.org ↗

research

Vevo2 unifies speech and singing voice generation with controllable prosody and style

Researchers have introduced Vevo2, a unified framework that handles both controllable speech and singing voice generation through two specialized audio tokenizers. The approach enables fine-grained control over prosody, style, and timbre while addressing data scarcity in singing synthesis through joint speech-singing training.

March 6, 2026 · 5:09 AM2 min read

voice-synthesis speech-generation singing-synthesis

via arxiv.org ↗

benchmark

ObfusQAte framework reveals LLMs hallucinate when faced with obfuscated questions

Researchers have introduced ObfusQAte, a new benchmark framework designed to test large language model robustness on obfuscated factual questions. The framework reveals that leading LLMs exhibit significant failure rates and hallucination tendencies when presented with increasingly nuanced language variations.

March 5, 2026 · 5:38 AM2 min read

benchmark robustness factual-qa

via arxiv.org ↗

benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

March 5, 2026 · 5:08 AM2 min read

benchmark data-science web-agents

via arxiv.org ↗

research

New RL framework CORE helps LLMs bridge gap between solving math problems and understanding concepts

Researchers have identified a critical gap in how large language models learn mathematics: they can solve problems but often don't understand the underlying concepts. A new reinforcement learning framework called CORE addresses this by using explicit concept definitions as training signals, rather than just reinforcing correct final answers.

March 5, 2026 · 1:07 AM2 min read

reinforcement-learning mathematical-reasoning LLM-training

via arxiv.org ↗

benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

March 5, 2026 · 1:06 AM2 min read

benchmark reasoning STEM

via arxiv.org ↗

benchmark

AttackSeqBench measures LLM capabilities for cybersecurity threat analysis

Researchers introduced AttackSeqBench, a benchmark for evaluating how well large language models understand and reason about cyber attack sequences in threat intelligence reports. The evaluation tested 7 LLMs and 5 reasoning models across multiple tasks, revealing gaps in their ability to extract actionable security insights from unstructured cybersecurity data.

March 5, 2026 · 1:05 AM2 min read

benchmark cybersecurity llm-evaluation

via arxiv.org ↗