LLM News

Every LLM release, update, and milestone.

Filtered by:optimization✕ clear

research

Progressive Residual Warmup improves LLM pretraining stability and convergence speed

Researchers propose Progressive Residual Warmup (ProRes), a pretraining technique that staggers layer learning by gradually warming residual connections from 0 to 1, with deeper layers taking longer to activate. The method demonstrates faster convergence, stronger generalization, and improved downstream performance across multiple model scales and initialization schemes.

March 6, 2026 · 5:53 AM2 min read

pretraining transformers optimization

via arxiv.org ↗

researchNVIDIA

POET-X reduces LLM training memory by 40%, enables billion-parameter models on single H100

Researchers introduce POET-X, a memory-efficient variant of the Reparameterized Orthogonal Equivalence Training framework that reduces computational overhead in LLM training. The method enables pretraining of billion-parameter models on a single Nvidia H100 GPU, where standard optimizers like AdamW exhaust memory.

March 6, 2026 · 5:22 AM2 min read

llm-training optimization memory-efficiency

via arxiv.org ↗

research

DiaBlo: Diagonal Block Fine-Tuning Matches Full Model Performance With Lower Cost

Researchers introduce DiaBlo, a parameter-efficient fine-tuning method that updates only diagonal blocks of model weight matrices instead of full parameters. The approach matches full-model fine-tuning performance across reasoning, code generation, and safety tasks while maintaining comparable memory usage and training speed to LoRA.

March 5, 2026 · 1:09 AM2 min read

PEFT fine-tuning parameter-efficient-adaptation

via arxiv.org ↗

research

xLLM: Open-source inference framework claims 2.2x vLLM throughput on Ascend accelerators

Researchers have released xLLM, an open-source Large Language Model inference framework designed for enterprise-scale serving. The framework claims to achieve up to 2.2x higher throughput than vLLM-Ascend when serving Qwen-series models under identical latency constraints, using a novel decoupled architecture that separates service scheduling from engine optimization.

March 5, 2026 · 12:51 AM2 min read

inference llm-framework open-source

via arxiv.org ↗