inference-optimization

4 articles tagged with inference-optimization

May 23, 2026

NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation Than Autoregressive Decoding

NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models at 3B, 8B, and 14B scales that generate multiple tokens in parallel rather than one at a time. The 8B model achieves 6.4× higher tokens per forward pass than autoregressive models in self-speculation mode while maintaining comparable accuracy.

May 23, 2026 · 12:21 AM

April 24, 2026

model releaseDeepSeek

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

DeepSeek released V4 in two versions: a 284 billion parameter Flash model and a 1.6 trillion parameter Pro model with 49 billion active parameters. According to DeepSeek, the models use 9.5x-13.7x less memory than V3 through compressed attention mechanisms and FP4/FP8 mixed precision, while supporting a 1 million token context window.

April 24, 2026 · 9:36 PM

March 28, 2026

model releaseNVIDIA

NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.

March 28, 2026 · 2:20 PM

March 24, 2026

model releaseStability AI

Stability AI and NVIDIA launch Stable Diffusion 3.5 NIM for faster image generation

Stability AI and NVIDIA have launched Stable Diffusion 3.5 NIM, a microservice designed to accelerate image generation performance and simplify enterprise deployment. The collaboration packages Stable Diffusion 3.5 as an NVIDIA NIM (NVIDIA Inference Microservice) for optimized inference.

March 24, 2026 · 5:22 PM

← Back to all news