LLM News

Every LLM release, update, and milestone.

Filtered by:efficiency✕ clear

research

Researchers propose Mixture of Universal Experts to scale MoE models via depth-width transformation

Researchers have introduced Mixture of Universal Experts (MoUE), a generalization of Mixture-of-Experts architectures that adds a new scaling dimension called virtual width. The approach reuses a shared expert pool across layers while maintaining fixed per-token computation, achieving up to 1.3% improvements over standard MoE baselines and enabling 4.2% gains when converting existing MoE checkpoints.

March 6, 2026 · 5:36 AM2 min read

mixture-of-experts moe neural-architecture

via arxiv.org ↗

research

ms-Mamba outperforms Transformer models on time-series forecasting with fewer parameters

Researchers introduced ms-Mamba, a multi-scale Mamba architecture for time-series forecasting that outperforms recent Transformer and Mamba-based models while using significantly fewer parameters. On the Solar-Energy dataset, ms-Mamba achieved 0.229 mean-squared error versus 0.240 for S-Mamba while using only 3.53M parameters compared to 4.77M.

March 6, 2026 · 5:20 AM2 min read

time-series-forecasting mamba architecture

via arxiv.org ↗

research

FLoC reduces video AI token load by 50%+ without retraining using facility location algorithm

Researchers propose FLoC, a training-free visual token compression framework that selects representative subsets of video tokens using facility location algorithms and lazy greedy optimization. The method works across any video-based large multimodal model without requiring retraining, achieving near-optimal compression ratios on benchmarks including Video-MME, MLVU, LongVideoBench, and EgoSchema.

March 6, 2026 · 5:08 AM2 min read

video-understanding large-multimodal-models token-compression

via arxiv.org ↗

research

FreeAct framework relaxes quantization constraints for multimodal and diffusion LLMs

Researchers propose FreeAct, a quantization framework that abandons static one-to-one transformation constraints to handle dynamic activation patterns in multimodal and diffusion LLMs. The method assigns token-specific transformation matrices to activations while keeping weights unified, demonstrating up to 5.3% performance improvements over existing approaches.

March 6, 2026 · 5:05 AM2 min read

quantization efficiency multimodal-llms

via arxiv.org ↗

research

MemSifter uses smaller proxy models to handle LLM memory retrieval, reducing computational overhead

Researchers introduce MemSifter, a framework that offloads memory retrieval to smaller proxy models instead of burdening the primary LLM. The approach uses outcome-driven reinforcement learning to optimize retrieval accuracy while minimizing computational overhead during inference.

March 5, 2026 · 5:54 AM2 min read

llm-research memory-retrieval reinforcement-learning

via arxiv.org ↗

model release

Google releases Gemini 3.1 Flash-Lite, fastest model in 3 series

Google DeepMind has released Gemini 3.1 Flash-Lite, positioning it as the fastest and most cost-efficient model in the Gemini 3 series. The release targets applications requiring high-speed inference at scale, continuing Google's multi-tier model strategy across the Gemini family.

March 3, 2026 · 4:50 PM2 min read

gemini google-deepmind model-release

via deepmind.google ↗

research

New pruning technique cuts diffusion language model inference costs by identifying unstable attention sinks

Researchers have identified a fundamental difference in how attention mechanisms work in diffusion language models versus traditional autoregressive LLMs, enabling a new pruning strategy that removes unstable attention sinks without retraining. The finding challenges existing pruning assumptions inherited from autoregressive models and promises better quality-efficiency trade-offs during inference.

February 20, 2026 · 3:21 AM2 min read

diffusion-language-models pruning inference-optimization

via arxiv.org ↗