LLM News

Every LLM release, update, and milestone.

Filtered by:model-compression✕ clear
research

1.58-bit BitNet models naturally support structured sparsity with minimal accuracy loss

Researchers have demonstrated that 1.58-bit quantized language models are naturally more compatible with semi-structured N:M sparsity than full-precision models. The Sparse-BitNet framework combines both techniques simultaneously, achieving up to 1.30X speedups in training and inference while maintaining smaller accuracy degradation than full-precision baselines at equivalent sparsity levels.

2 min readvia arxiv.org
research

ButterflyMoE achieves 150× memory reduction for mixture-of-experts models via geometric rotations

Researchers introduce ButterflyMoE, a technique that replaces independent expert weight matrices with learned geometric rotations applied to a shared quantized substrate. The method reduces memory scaling from linear to sub-linear in the number of experts, achieving 150× compression at 256 experts with negligible accuracy loss on language modeling tasks.

research

StructLens reveals hidden structural patterns across language model layers

Researchers introduce StructLens, an interpretability framework that analyzes language models by constructing maximum spanning trees from residual streams to uncover inter-layer structural relationships. The approach reveals similarity patterns distinct from conventional cosine similarity and demonstrates practical benefits for layer pruning optimization.

research

SiNGER framework improves vision transformer distillation by suppressing high-norm artifacts

Researchers introduce SiNGER (Singular Nullspace-Guided Energy Reallocation), a knowledge distillation framework that improves how Vision Transformer features transfer to smaller student models. The method suppresses high-norm artifacts that degrade representation quality while preserving informative signals from teacher models.

research

Researchers develop pruning method that challenges attention-sink assumptions in diffusion language models

A new pruning method challenges the conventional wisdom inherited from autoregressive LLMs about preserving attention-sink tokens. Researchers demonstrate that attention sinks in diffusion language models are substantially less stable than in AR models, enabling more aggressive pruning without retraining.

research

New pruning technique cuts diffusion language model inference costs by identifying unstable attention sinks

Researchers have identified a fundamental difference in how attention mechanisms work in diffusion language models versus traditional autoregressive LLMs, enabling a new pruning strategy that removes unstable attention sinks without retraining. The finding challenges existing pruning assumptions inherited from autoregressive models and promises better quality-efficiency trade-offs during inference.