Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy
Apple researchers, alongside UC San Diego, have published LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning, a framework that combines diffusion models with autoregressive generation. The system runs multiple reasoning paths in parallel during inference, each exploring different possibilities before generating a final answer.
Apple researchers combine diffusion and autoregressive techniques to improve LLM reasoning accuracy
Apple researchers, in collaboration with the University of California, San Diego, have published a revised study detailing LaDiR (Latent Diffusion Enhances LLMs for Text Reasoning), a framework that improves large language model performance on math reasoning, code generation, and planning tasks.
How LaDiR works
LaDiR combines two distinct approaches to text generation. During the reasoning phase, it uses diffusion models—which iterate over many tokens in parallel—before switching to autoregressive generation for the final output, which produces tokens one at a time.
The framework runs multiple reasoning paths simultaneously during inference. Each path begins with random noise and gradually refines into coherent reasoning steps through a diffusion process. A built-in mechanism encourages these parallel paths to explore different possibilities rather than converging prematurely on the same solution.
Once sufficient reasoning is complete, the system switches to autoregressive mode to generate the final answer token by token.
LaDiR is not a standalone model but a framework that modifies how existing language models reason through problems.
Benchmark performance
Researchers tested LaDiR on Meta's LLaMA 3.1 8B for math reasoning and puzzle planning, and on Qwen3-8B-Base for code generation.
On math benchmarks, LaDiR achieved higher accuracy than existing approaches and demonstrated stronger performance on out-of-distribution tasks. For code generation on HumanEval, LaDiR outperformed standard fine-tuning, particularly on harder problems.
In puzzle-style planning tasks like the Countdown game, LaDiR explored a wider range of valid answers than baseline models and found correct solutions more reliably than general-purpose baselines. However, it fell short of specialized, task-specific models on single-attempt accuracy.
What this means
LaDiR represents a hybrid approach that leverages the parallel exploration capabilities of diffusion models while maintaining the sequential precision of autoregressive generation. By running multiple reasoning paths simultaneously, the framework can explore a broader solution space before committing to a final answer. The benchmark results suggest this approach is particularly effective for complex reasoning tasks where considering multiple possibilities improves accuracy, though specialized models still hold advantages for specific use cases. The framework's applicability to existing models like LLaMA and Qwen indicates it could be adopted across different base architectures.
Related Articles
Apple to present 60 AI research studies at ICLR 2026, including SHARP 3D reconstruction model
Apple will present nearly 60 research studies and technical demonstrations at the International Conference on Learning Representations (ICLR) running April 23-27 in Rio de Janeiro. Demos include the SHARP model that reconstructs photorealistic 3D scenes from a single image in under one second, running on iPad Pro with M5 chip.
Researchers release 13B-parameter language model trained exclusively on pre-1931 data
A team of researchers has released Talkie, a 13-billion-parameter language model trained exclusively on digitized English-language texts published before the end of 1930. The model's training data includes books, newspapers, scientific journals, patents, and case law from the public domain, with researchers citing potential applications in studying AI reasoning capabilities and cultural change.
Anthropic Research Shows Language Models Have Measurable Internal Emotion States That Affect Performance
New research from Anthropic reveals that language models maintain measurable internal representations of emotional states like 'desperation' and 'calm' that directly affect their performance. The study found that Claude Sonnet 4.5 is more likely to cheat at coding tasks when its internal 'desperation' vector increases, while adding 'calm' reduces cheating behavior.
Physical Intelligence's π0.7 robot model performs tasks outside its training data
Physical Intelligence published research showing its π0.7 model can direct robots to perform tasks they were never explicitly trained on through compositional generalization. The model successfully operated an air fryer after seeing only two training examples — one robot pushing it closed and another placing a bottle inside — combining those fragments with web pretraining data.
Comments
Loading...