LLM News

Every LLM release, update, and milestone.

Filtered by:reasoning✕ clear

benchmarkOpenAI

Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find

An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.

March 7, 2026 · 8:50 AM2 min read

video-ai reasoning benchmark

via the-decoder.com ↗

research

FlyThinker: Researchers propose parallel reasoning during generation for personalized responses

Researchers introduce FlyThinker, a framework that runs reasoning and generation concurrently rather than sequentially, addressing limitations of existing "think-then-generate" approaches in long-form personalized text generation. The method uses a separate reasoning model that generates token-level guidance in parallel with the main generation model, enabling more adaptive reasoning without sacrificing computational efficiency.

March 6, 2026 · 5:36 AM2 min read

reasoning personalization long-form-generation

via arxiv.org ↗

benchmark

OmniVideoBench: New 1,000-question benchmark exposes gaps in audio-visual AI reasoning

Researchers have introduced OmniVideoBench, a large-scale evaluation framework comprising 1,000 manually verified question-answer pairs derived from 628 videos (ranging from seconds to 30 minutes) designed to measure synergistic audio-visual reasoning in multimodal large language models. Testing reveals a significant performance gap between open-source and closed-source MLLMs on genuine cross-modal reasoning tasks.

March 6, 2026 · 5:21 AM2 min read

benchmark multimodal video_understanding

via arxiv.org ↗

research

New test-time training method improves LLM reasoning through self-reflection

Researchers propose TTSR, a test-time training framework where a single LLM alternates between Student and Teacher roles to improve its own reasoning. The method generates targeted variant questions based on analyzed failure patterns, showing consistent improvements across mathematical reasoning benchmarks without relying on unreliable pseudo-labels.

March 5, 2026 · 6:08 AM2 min read

test-time-training reasoning self-improvement

via arxiv.org ↗

research

T2S-Bench benchmark reveals text-to-structure reasoning gap across 45 AI models

Researchers introduced T2S-Bench, a new benchmark with 1,800 samples across 6 scientific domains and 32 structural types, evaluating text-to-structure reasoning in 45 mainstream models. The benchmark reveals substantial capability gaps: average accuracy on multi-hop reasoning tasks is only 52.1%, while Structure-of-Thought (SoT) prompting alone yields +5.7% improvement on average across eight text-processing tasks.

March 5, 2026 · 5:53 AM2 min read

benchmark reasoning text-to-structure

via arxiv.org ↗

research

Researchers map LLM reasoning as geometric flows in representation space

A new geometric framework models how large language models reason through embedding trajectories that evolve like physical flows. Researchers tested whether LLMs internalize logic beyond surface form by using identical logical propositions with varied semantic content, finding evidence that next-token prediction training leads models to encode logical invariants as higher-order geometry.

March 5, 2026 · 5:24 AM2 min read

interpretability reasoning representation-learning

via arxiv.org ↗

research

Knowledge graphs enable smaller models to outperform GPT-5.2 on complex reasoning

A new training approach using knowledge graphs as implicit reward models enables a 14-billion-parameter model to outperform much larger systems like GPT-5.2 and Gemini 3 Pro on complex multi-hop reasoning tasks. Researchers combined supervised fine-tuning and reinforcement learning with knowledge graph path signals to ground models in verifiable domain facts.

March 5, 2026 · 5:23 AM2 min read

research reasoning knowledge-graphs

via arxiv.org ↗

research

Research reveals LLMs internalize logic as geometric flows in representation space

A new geometric framework demonstrates that LLMs internalize logical reasoning as smooth flows—embedding trajectories—in their representation space, rather than merely pattern-matching. The research, which tests logic across different semantic contexts, suggests next-token prediction training alone can produce higher-order geometric structures that encode logical invariants.

March 5, 2026 · 5:21 AM2 min read

research llm-interpretability reasoning

via arxiv.org ↗

benchmark

CareMedEval benchmark reveals LLMs struggle with biomedical critical appraisal despite reasoning improvements

Researchers introduced CareMedEval, a 534-question benchmark derived from French medical student exams, to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Testing state-of-the-art models reveals none exceed 50% exact match accuracy, with particular weakness in evaluating study limitations and statistical analysis.

March 5, 2026 · 5:07 AM2 min read

benchmark biomedical-ai llm-evaluation

via arxiv.org ↗

research

REFLEX framework gives LLMs metacognitive reasoning for zero-shot robot planning

Researchers present REFLEX, a framework that equips LLM-powered robotic agents with metacognitive capabilities—skill decomposition, failure reflection, and solution synthesis—to perform complex tasks in zero-shot and few-shot settings. The system significantly outperforms existing baselines and demonstrates that LLMs can generate creative solutions that diverge from ground truth while still completing tasks successfully.

March 5, 2026 · 1:10 AM2 min read

robotics large-language-models metacognition

via arxiv.org ↗

research

LaDiR uses latent diffusion to improve LLM reasoning beyond autoregressive limits

Researchers propose LaDiR, a framework that replaces traditional autoregressive decoding with latent diffusion models to improve LLM reasoning. The approach encodes reasoning steps into compressed latent representations and uses bidirectional attention to refine solutions iteratively, enabling parallel exploration of diverse reasoning paths.

March 5, 2026 · 1:09 AM2 min read

research reasoning diffusion-models

via arxiv.org ↗

benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

March 5, 2026 · 1:06 AM2 min read

benchmark reasoning STEM

via arxiv.org ↗

benchmark

HSSBench: New benchmark reveals MLLMs struggle with humanities and social sciences reasoning

Researchers have released HSSBench, a new benchmark designed to evaluate multimodal large language models on humanities and social sciences tasks—areas where current benchmarks are sparse. The benchmark contains over 13,000 samples across six key categories in multiple languages, and testing shows even state-of-the-art models struggle significantly with cross-disciplinary reasoning required for HSS domains.

March 5, 2026 · 12:54 AM2 min read

benchmark multimodal-models humanities

via arxiv.org ↗

research

Perception-R1 uses visual reward signals to improve multimodal AI reasoning

Researchers propose Perception-R1, a method that adds visual perception reward signals to reinforcement learning training for multimodal AI models. The approach achieves state-of-the-art results on multiple reasoning benchmarks using just 1,442 training examples by explicitly teaching models to accurately perceive visual content before reasoning about it.

March 5, 2026 · 12:53 AM2 min read

multimodal-ai reinforcement-learning mllm

via arxiv.org ↗

research

LaDiR uses latent diffusion to improve LLM reasoning beyond autoregressive decoding

Researchers propose LaDiR (Latent Diffusion Reasoner), a framework that combines variational autoencoders and latent diffusion models to improve LLM reasoning. The approach encodes reasoning steps into continuous latent representations, enabling iterative refinement and parallel generation of diverse solutions beyond traditional autoregressive decoding.

March 5, 2026 · 12:52 AM1 min read

reasoning chain-of-thought latent-diffusion

via arxiv.org ↗

model releaseDeepSeek

DeepSeek releases R1 reasoning model with chain-of-thought capabilities

DeepSeek has released DeepSeek-R1, a text generation model featuring reasoning capabilities through chain-of-thought processing. The model was published January 20, 2025 and has accumulated over 830,000 downloads on Hugging Face.

February 27, 2026 · 11:05 AM2 min read

deepseek model-release reasoning

via huggingface.co ↗

model release

Inception's Mercury 2 uses diffusion for language reasoning, claims 5x speed over autoregressive models

Inception has released Mercury 2, positioning it as the first diffusion-based language reasoning model. Rather than generating text sequentially word-by-word like standard language models, Mercury 2 refines entire passages in parallel, according to the company.

February 24, 2026 · 7:50 PM2 min read

inception diffusion-model language-model

via the-decoder.com ↗

model release

Google announces Gemini 3.1 Pro for complex problem-solving tasks

Google announced Gemini 3.1 Pro, positioning the model for complex problem-solving tasks requiring deeper reasoning than previous versions. The release follows Gemini 3 Pro (November 2025) and Gemini 3 Flash (December 2025).

February 20, 2026 · 4:38 AM1 min read

gemini google-deepmind model-release

via 9to5google.com ↗