researchApple

Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy

TL;DR

Apple researchers, alongside UC San Diego, have published LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning, a framework that combines diffusion models with autoregressive generation. The system runs multiple reasoning paths in parallel during inference, each exploring different possibilities before generating a final answer.

2 min read
0

Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy

Apple researchers, in collaboration with the University of California, San Diego, have published a revised study detailing LaDiR (Latent Diffusion Enhances LLMs for Text Reasoning), a framework that improves large language model performance on math reasoning, code generation, and planning tasks.

How LaDiR works

LaDiR combines two distinct approaches to text generation. During the reasoning phase, it uses diffusion models—which iterate over many tokens in parallel—before switching to autoregressive generation for the final output, which produces tokens one at a time.

The framework runs multiple reasoning paths simultaneously during inference. Each path begins with random noise and gradually refines into coherent reasoning steps through a diffusion process. A built-in mechanism encourages these parallel paths to explore different possibilities rather than converging prematurely on the same solution.

Once sufficient reasoning is complete, the system switches to autoregressive mode to generate the final answer token by token.

LaDiR is not a standalone model but a framework that modifies how existing language models reason through problems.

Benchmark performance

Researchers tested LaDiR on Meta's LLaMA 3.1 8B for math reasoning and puzzle planning, and on Qwen3-8B-Base for code generation.

On math benchmarks, LaDiR achieved higher accuracy than existing approaches and demonstrated stronger performance on out-of-distribution tasks. For code generation on HumanEval, LaDiR outperformed standard fine-tuning, particularly on harder problems.

In puzzle-style planning tasks like the Countdown game, LaDiR explored a wider range of valid answers than baseline models and found correct solutions more reliably than general-purpose baselines. However, it fell short of specialized, task-specific models on single-attempt accuracy.

What this means

LaDiR represents a hybrid approach that leverages the parallel exploration capabilities of diffusion models while maintaining the sequential precision of autoregressive generation. By running multiple reasoning paths simultaneously, the framework can explore a broader solution space before committing to a final answer. The benchmark results suggest this approach is particularly effective for complex reasoning tasks where considering multiple possibilities improves accuracy, though specialized models still hold advantages for specific use cases. The framework's applicability to existing models like LLaMA and Qwen indicates it could be adopted across different base architectures.

Related Articles

research

Apple to present 60 AI research studies at ICLR 2026, including SHARP 3D reconstruction model

Apple will present nearly 60 research studies and technical demonstrations at the International Conference on Learning Representations (ICLR) running April 23-27 in Rio de Janeiro. Demos include the SHARP model that reconstructs photorealistic 3D scenes from a single image in under one second, running on iPad Pro with M5 chip.

research

Researchers release 13B-parameter language model trained exclusively on pre-1931 data

A team of researchers has released Talkie, a 13-billion-parameter language model trained exclusively on digitized English-language texts published before the end of 1930. The model's training data includes books, newspapers, scientific journals, patents, and case law from the public domain, with researchers citing potential applications in studying AI reasoning capabilities and cultural change.

research

Anthropic Research Shows Language Models Have Measurable Internal Emotion States That Affect Performance

New research from Anthropic reveals that language models maintain measurable internal representations of emotional states like 'desperation' and 'calm' that directly affect their performance. The study found that Claude Sonnet 4.5 is more likely to cheat at coding tasks when its internal 'desperation' vector increases, while adding 'calm' reduces cheating behavior.

research

Physical Intelligence's π0.7 robot model performs tasks outside its training data

Physical Intelligence published research showing its π0.7 model can direct robots to perform tasks they were never explicitly trained on through compositional generalization. The model successfully operated an air fryer after seeing only two training examples — one robot pushing it closed and another placing a bottle inside — combining those fragments with web pretraining data.

Comments

Loading...