LLM News

Every LLM release, update, and milestone.

Filtered by:quantization✕ clear

research

1.58-bit BitNet models naturally support structured sparsity with minimal accuracy loss

Researchers have demonstrated that 1.58-bit quantized language models are naturally more compatible with semi-structured N:M sparsity than full-precision models. The Sparse-BitNet framework combines both techniques simultaneously, achieving up to 1.30X speedups in training and inference while maintaining smaller accuracy degradation than full-precision baselines at equivalent sparsity levels.

March 6, 2026 · 5:54 AM2 min read

quantization sparsity bitnet

via arxiv.org ↗

research

ButterflyMoE achieves 150× memory reduction for mixture-of-experts models via geometric rotations

Researchers introduce ButterflyMoE, a technique that replaces independent expert weight matrices with learned geometric rotations applied to a shared quantized substrate. The method reduces memory scaling from linear to sub-linear in the number of experts, achieving 150× compression at 256 experts with negligible accuracy loss on language modeling tasks.

March 6, 2026 · 5:07 AM2 min read

mixture-of-experts model-compression quantization

via arxiv.org ↗

research

FreeAct framework relaxes quantization constraints for multimodal and diffusion LLMs

Researchers propose FreeAct, a quantization framework that abandons static one-to-one transformation constraints to handle dynamic activation patterns in multimodal and diffusion LLMs. The method assigns token-specific transformation matrices to activations while keeping weights unified, demonstrating up to 5.3% performance improvements over existing approaches.

March 6, 2026 · 5:05 AM2 min read

quantization efficiency multimodal-llms

via arxiv.org ↗

model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

March 1, 2026 · 11:20 AM1 min read

qwen alibaba-qwen model-release

via huggingface.co ↗

product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.

February 20, 2026 · 10:20 PM2 min read

taalas llama-3-1 inference

via simonwillison.net ↗