inference-optimization
4 articles tagged with inference-optimization
NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation Than Autoregressive Decoding
NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models at 3B, 8B, and 14B scales that generate multiple tokens in parallel rather than one at a time. The 8B model achieves 6.4× higher tokens per forward pass than autoregressive models in self-speculation mode while maintaining comparable accuracy.
DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3
DeepSeek released V4 in two versions: a 284 billion parameter Flash model and a 1.6 trillion parameter Pro model with 49 billion active parameters. According to DeepSeek, the models use 9.5x-13.7x less memory than V3 through compressed attention mechanisms and FP4/FP8 mixed precision, while supporting a 1 million token context window.
NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains
NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.
Stability AI and NVIDIA launch Stable Diffusion 3.5 NIM for faster image generation
Stability AI and NVIDIA have launched Stable Diffusion 3.5 NIM, a microservice designed to accelerate image generation performance and simplify enterprise deployment. The collaboration packages Stable Diffusion 3.5 as an NVIDIA NIM (NVIDIA Inference Microservice) for optimized inference.