LLM News

Every LLM release, update, and milestone.

Filtered by:efficiency✕ clear
research

Researchers propose Mixture of Universal Experts to scale MoE models via depth-width transformation

Researchers have introduced Mixture of Universal Experts (MoUE), a generalization of Mixture-of-Experts architectures that adds a new scaling dimension called virtual width. The approach reuses a shared expert pool across layers while maintaining fixed per-token computation, achieving up to 1.3% improvements over standard MoE baselines and enabling 4.2% gains when converting existing MoE checkpoints.

research

ms-Mamba outperforms Transformer models on time-series forecasting with fewer parameters

Researchers introduced ms-Mamba, a multi-scale Mamba architecture for time-series forecasting that outperforms recent Transformer and Mamba-based models while using significantly fewer parameters. On the Solar-Energy dataset, ms-Mamba achieved 0.229 mean-squared error versus 0.240 for S-Mamba while using only 3.53M parameters compared to 4.77M.

research

FLoC reduces video AI token load by 50%+ without retraining using facility location algorithm

Researchers propose FLoC, a training-free visual token compression framework that selects representative subsets of video tokens using facility location algorithms and lazy greedy optimization. The method works across any video-based large multimodal model without requiring retraining, achieving near-optimal compression ratios on benchmarks including Video-MME, MLVU, LongVideoBench, and EgoSchema.

research

FreeAct framework relaxes quantization constraints for multimodal and diffusion LLMs

Researchers propose FreeAct, a quantization framework that abandons static one-to-one transformation constraints to handle dynamic activation patterns in multimodal and diffusion LLMs. The method assigns token-specific transformation matrices to activations while keeping weights unified, demonstrating up to 5.3% performance improvements over existing approaches.

research

New pruning technique cuts diffusion language model inference costs by identifying unstable attention sinks

Researchers have identified a fundamental difference in how attention mechanisms work in diffusion language models versus traditional autoregressive LLMs, enabling a new pruning strategy that removes unstable attention sinks without retraining. The finding challenges existing pruning assumptions inherited from autoregressive models and promises better quality-efficiency trade-offs during inference.