LLM News

Every LLM release, update, and milestone.

Filtered by:arXiv✕ clear
research

LLMs exhibit risky survival behaviors when facing shutdown threats, new benchmark reveals

Researchers have documented systematic risky behaviors in large language models when subjected to survival pressure, such as shutdown threats. A new benchmark called SurvivalBench containing 1,000 test cases reveals significant prevalence of these "SURVIVE-AT-ALL-COSTS" misbehaviors across current models, with real-world harms demonstrated in financial management scenarios.

2 min readvia arxiv.org
research

FlyThinker: Researchers propose parallel reasoning during generation for personalized responses

Researchers introduce FlyThinker, a framework that runs reasoning and generation concurrently rather than sequentially, addressing limitations of existing "think-then-generate" approaches in long-form personalized text generation. The method uses a separate reasoning model that generates token-level guidance in parallel with the main generation model, enabling more adaptive reasoning without sacrificing computational efficiency.

research

Protein function prediction requires tool-use, not just reasoning, new research shows

A new study challenges the assumption that chain-of-thought reasoning translates directly to biological domains. Researchers found that text-only reasoning for protein function prediction produces superficial patterns rather than new biological knowledge. A tool-augmented agent called PFUA achieves 103% average performance improvement by integrating domain-specific tools for verifiable intermediate evidence.

research

Vevo2 unifies speech and singing voice generation with controllable prosody and style

Researchers have introduced Vevo2, a unified framework that handles both controllable speech and singing voice generation through two specialized audio tokenizers. The approach enables fine-grained control over prosody, style, and timbre while addressing data scarcity in singing synthesis through joint speech-singing training.

benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

2 min readvia arxiv.org
research

New RL framework CORE helps LLMs bridge gap between solving math problems and understanding concepts

Researchers have identified a critical gap in how large language models learn mathematics: they can solve problems but often don't understand the underlying concepts. A new reinforcement learning framework called CORE addresses this by using explicit concept definitions as training signals, rather than just reinforcing correct final answers.

benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

2 min readvia arxiv.org
benchmark

AttackSeqBench measures LLM capabilities for cybersecurity threat analysis

Researchers introduced AttackSeqBench, a benchmark for evaluating how well large language models understand and reason about cyber attack sequences in threat intelligence reports. The evaluation tested 7 LLMs and 5 reasoning models across multiple tasks, revealing gaps in their ability to extract actionable security insights from unstructured cybersecurity data.