NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains
NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.
gpt-oss-puzzle-88B — Quick Specs
NVIDIA Releases gpt-oss-puzzle-88B: Inference-Optimized Reasoning Model
NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a deployment-optimized large language model derived from OpenAI's gpt-oss-120b. The model reduces parameters to 88B (73% of parent) while improving inference throughput for reasoning workloads, particularly on NVIDIA H100-class hardware.
Performance Metrics
Compared to its 120B parent model:
- Long-context (64K/64K) throughput: 1.63× improvement on 8×H100 node
- Short-context (4K/4K) throughput: 1.22× improvement on 8×H100 node
- Single H100 GPU throughput: Up to 2.82× improvement
- Accuracy: Matches or slightly exceeds parent across reasoning effort budgets
The model targets inference bottlenecks in KV-cache bandwidth and memory capacity rather than raw compute—the primary constraints for reasoning models on H100s.
Architecture and Optimizations
gpt-oss-puzzle-88B uses a mixture-of-experts decoder-only transformer with three key architectural optimizations:
Heterogeneous MoE Expert Pruning: Each MoE layer retains a different number of experts via activation-based importance scoring. Early layers preserve more experts; later layers are aggressively pruned.
Selective Window Attention: A subset of global attention layers replaced with 8K-window attention, reducing KV-cache footprint by ~40% in long-context scenarios while preserving long-range reasoning capability.
RoPE Scaling Adjustment: YaRN RoPE scaling factor increased to improve stability at 128K context length.
Training Pipeline
The model underwent three-stage optimization:
-
Knowledge Distillation (84B tokens, 128K sequence length): Restored inter-block compatibility and recovered quality lost during architecture search using Megatron-LM framework
-
Reinforcement Learning: Post-distillation RL phase applied across math, coding, and reasoning domains with two complementary policies—high-effort-focused (max accuracy) and mixed-effort (length-regularized)—combined via checkpoint weight averaging
-
Quantization: MoE weights quantized to MXFP4; KV-cache quantized to FP8 with calibrated scales, achieving ~2× KV-cache token capacity and faster attention kernels while preserving accuracy
Reasoning Effort Control
The model supports three configurable reasoning effort modes (Low, Medium, High) that reliably control generation length and accuracy, enabling cost-aware deployment for different use cases.
Specs and Deployment
- Context window: 128K tokens
- Architecture: Mixture-of-Experts decoder-only transformer
- Parameter count: 88B (Hugging Face Hub may display ~91B including MXFP4 quantization scales for MoE experts)
- Supported hardware: NVIDIA H100-80GB, B200
- Runtime: vLLM
- Operating system: Linux
- License: NVIDIA Open Model License
- Use case: Production deployment, cost-efficient reasoning, long-context inference
The model is ready for commercial use and was trained on text data spanning 2013 to May 1, 2025, across seven datasets including competitive programming, mathematics, instruction-following, and multi-choice QA domains. No personal data was used in training.
What This Means
NVIDIA's release demonstrates the efficiency gains possible through post-training neural architecture search combined with knowledge distillation and RL optimization. By pruning parameters (73% of parent) while improving throughput via architectural heterogeneity and selective attention, gpt-oss-puzzle-88B targets the inference economics problem for reasoning models—cost per token and latency on production hardware matter more than raw capability. The three reasoning effort modes enable operators to trade accuracy for cost per request, a pattern increasingly important as reasoning models become production infrastructure. This positions NVIDIA's Puzzle framework as a differentiation vector for inference-optimized model development.
Related Articles
Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning
Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.
Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use
Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.
Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens
Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.
Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks
Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.
Comments
Loading...