LLM News

Every LLM release, update, and milestone.

Filtered by:chain-of-thought✕ clear
research

Protein function prediction requires tool-use, not just reasoning, new research shows

A new study challenges the assumption that chain-of-thought reasoning translates directly to biological domains. Researchers found that text-only reasoning for protein function prediction produces superficial patterns rather than new biological knowledge. A tool-augmented agent called PFUA achieves 103% average performance improvement by integrating domain-specific tools for verifiable intermediate evidence.

research

Study shows RL training enables LLMs to abstain on unanswerable temporal questions, outperforming GPT-4o

A new arXiv study presents the first systematic evaluation of training large language models to abstain—refuse to answer—on temporal questions they cannot reliably answer. Using reinforcement learning with abstention-aware rewards, researchers achieved 3.46-5.80% higher accuracy on temporal QA benchmarks than GPT-4o, while improving true positive rates on unanswerable questions by 20%.

2 min readvia arxiv.org
research

Reasoning models fail at theory of mind tasks despite math excellence

A systematic study of nine advanced language models reveals that large reasoning models—designed to excel at step-by-step math and coding—actually underperform or match non-reasoning models on theory of mind tasks. The research identifies a critical weakness: longer reasoning chains actively harm social reasoning performance, suggesting current reasoning architectures don't transfer to socio-cognitive skills.

research

LaDiR uses latent diffusion to improve LLM reasoning beyond autoregressive limits

Researchers propose LaDiR, a framework that replaces traditional autoregressive decoding with latent diffusion models to improve LLM reasoning. The approach encodes reasoning steps into compressed latent representations and uses bidirectional attention to refine solutions iteratively, enabling parallel exploration of diverse reasoning paths.

2 min readvia arxiv.org
research

Alignment tuning shrinks LLM output diversity by 2-5x, new research shows

A new arXiv paper introduces the Branching Factor (BF), a metric quantifying output diversity in large language models, and finds that alignment tuning reduces this diversity by 2-5x overall—and up to 10x at early generation positions. The research suggests alignment doesn't fundamentally change model behavior but instead steers outputs toward lower-entropy token sequences already present in base models.

research

LaDiR uses latent diffusion to improve LLM reasoning beyond autoregressive decoding

Researchers propose LaDiR (Latent Diffusion Reasoner), a framework that combines variational autoencoders and latent diffusion models to improve LLM reasoning. The approach encodes reasoning steps into continuous latent representations, enabling iterative refinement and parallel generation of diverse solutions beyond traditional autoregressive decoding.

researchByteDance

Bytedance study: reasoning models know when to stop, but sampling methods force continued thinking

A new Bytedance study reveals that large reasoning models actually know when they've reached the correct answer, but common sampling methods prevent them from stopping. The models engage in unnecessary cross-checking and reformulation despite already solving problems correctly.