model releaseNVIDIA

NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains

TL;DR

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.

2 min read
0

NVIDIA Releases gpt-oss-puzzle-88B: Inference-Optimized Reasoning Model

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a deployment-optimized large language model derived from OpenAI's gpt-oss-120b. The model reduces parameters to 88B (73% of parent) while improving inference throughput for reasoning workloads, particularly on NVIDIA H100-class hardware.

Performance Metrics

Compared to its 120B parent model:

  • Long-context (64K/64K) throughput: 1.63× improvement on 8×H100 node
  • Short-context (4K/4K) throughput: 1.22× improvement on 8×H100 node
  • Single H100 GPU throughput: Up to 2.82× improvement
  • Accuracy: Matches or slightly exceeds parent across reasoning effort budgets

The model targets inference bottlenecks in KV-cache bandwidth and memory capacity rather than raw compute—the primary constraints for reasoning models on H100s.

Architecture and Optimizations

gpt-oss-puzzle-88B uses a mixture-of-experts decoder-only transformer with three key architectural optimizations:

Heterogeneous MoE Expert Pruning: Each MoE layer retains a different number of experts via activation-based importance scoring. Early layers preserve more experts; later layers are aggressively pruned.

Selective Window Attention: A subset of global attention layers replaced with 8K-window attention, reducing KV-cache footprint by ~40% in long-context scenarios while preserving long-range reasoning capability.

RoPE Scaling Adjustment: YaRN RoPE scaling factor increased to improve stability at 128K context length.

Training Pipeline

The model underwent three-stage optimization:

  1. Knowledge Distillation (84B tokens, 128K sequence length): Restored inter-block compatibility and recovered quality lost during architecture search using Megatron-LM framework

  2. Reinforcement Learning: Post-distillation RL phase applied across math, coding, and reasoning domains with two complementary policies—high-effort-focused (max accuracy) and mixed-effort (length-regularized)—combined via checkpoint weight averaging

  3. Quantization: MoE weights quantized to MXFP4; KV-cache quantized to FP8 with calibrated scales, achieving ~2× KV-cache token capacity and faster attention kernels while preserving accuracy

Reasoning Effort Control

The model supports three configurable reasoning effort modes (Low, Medium, High) that reliably control generation length and accuracy, enabling cost-aware deployment for different use cases.

Specs and Deployment

  • Context window: 128K tokens
  • Architecture: Mixture-of-Experts decoder-only transformer
  • Parameter count: 88B (Hugging Face Hub may display ~91B including MXFP4 quantization scales for MoE experts)
  • Supported hardware: NVIDIA H100-80GB, B200
  • Runtime: vLLM
  • Operating system: Linux
  • License: NVIDIA Open Model License
  • Use case: Production deployment, cost-efficient reasoning, long-context inference

The model is ready for commercial use and was trained on text data spanning 2013 to May 1, 2025, across seven datasets including competitive programming, mathematics, instruction-following, and multi-choice QA domains. No personal data was used in training.

What This Means

NVIDIA's release demonstrates the efficiency gains possible through post-training neural architecture search combined with knowledge distillation and RL optimization. By pruning parameters (73% of parent) while improving throughput via architectural heterogeneity and selective attention, gpt-oss-puzzle-88B targets the inference economics problem for reasoning models—cost per token and latency on production hardware matter more than raw capability. The three reasoning effort modes enable operators to trade accuracy for cost per request, a pattern increasingly important as reasoning models become production infrastructure. This positions NVIDIA's Puzzle framework as a differentiation vector for inference-optimized model development.

Related Articles

model release

Stability AI and NVIDIA launch Stable Diffusion 3.5 NIM for faster image generation

Stability AI and NVIDIA have launched Stable Diffusion 3.5 NIM, a microservice designed to accelerate image generation performance and simplify enterprise deployment. The collaboration packages Stable Diffusion 3.5 as an NVIDIA NIM (NVIDIA Inference Microservice) for optimized inference.

model release

Cohere releases 2B open-source speech model with 5.42% word error rate

Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.

model release

Anthropic confirms leaked model represents major reasoning advance after security breach

A data breach at Anthropic exposed internal documents detailing an unreleased AI model the company describes as its most powerful to date. Anthropic confirmed it is already testing the model with select customers, claiming significant advances in reasoning, coding, and cybersecurity. The breach resulted from a misconfiguration in Anthropic's content management system that automatically made ~3,000 uploaded files publicly accessible.

model release

Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model

Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.

Comments

Loading...