model releaseNVIDIA

NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains

TL;DR

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.

March 28, 2026 · 2:20 PM2 min read

gpt-oss-puzzle-88B — Quick Specs

Context window128K tokens

Compare gpt-oss-puzzle-88B with other models →

NVIDIA Releases gpt-oss-puzzle-88B: Inference-Optimized Reasoning Model

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a deployment-optimized large language model derived from OpenAI's gpt-oss-120b. The model reduces parameters to 88B (73% of parent) while improving inference throughput for reasoning workloads, particularly on NVIDIA H100-class hardware.

Performance Metrics

Compared to its 120B parent model:

Long-context (64K/64K) throughput: 1.63× improvement on 8×H100 node
Short-context (4K/4K) throughput: 1.22× improvement on 8×H100 node
Single H100 GPU throughput: Up to 2.82× improvement
Accuracy: Matches or slightly exceeds parent across reasoning effort budgets

The model targets inference bottlenecks in KV-cache bandwidth and memory capacity rather than raw compute—the primary constraints for reasoning models on H100s.

Architecture and Optimizations

gpt-oss-puzzle-88B uses a mixture-of-experts decoder-only transformer with three key architectural optimizations:

Heterogeneous MoE Expert Pruning: Each MoE layer retains a different number of experts via activation-based importance scoring. Early layers preserve more experts; later layers are aggressively pruned.

Selective Window Attention: A subset of global attention layers replaced with 8K-window attention, reducing KV-cache footprint by ~40% in long-context scenarios while preserving long-range reasoning capability.

RoPE Scaling Adjustment: YaRN RoPE scaling factor increased to improve stability at 128K context length.

Training Pipeline

The model underwent three-stage optimization:

Knowledge Distillation (84B tokens, 128K sequence length): Restored inter-block compatibility and recovered quality lost during architecture search using Megatron-LM framework
Reinforcement Learning: Post-distillation RL phase applied across math, coding, and reasoning domains with two complementary policies—high-effort-focused (max accuracy) and mixed-effort (length-regularized)—combined via checkpoint weight averaging
Quantization: MoE weights quantized to MXFP4; KV-cache quantized to FP8 with calibrated scales, achieving ~2× KV-cache token capacity and faster attention kernels while preserving accuracy

Reasoning Effort Control

The model supports three configurable reasoning effort modes (Low, Medium, High) that reliably control generation length and accuracy, enabling cost-aware deployment for different use cases.

Specs and Deployment

Context window: 128K tokens
Architecture: Mixture-of-Experts decoder-only transformer
Parameter count: 88B (Hugging Face Hub may display ~91B including MXFP4 quantization scales for MoE experts)
Supported hardware: NVIDIA H100-80GB, B200
Runtime: vLLM
Operating system: Linux
License: NVIDIA Open Model License
Use case: Production deployment, cost-efficient reasoning, long-context inference

The model is ready for commercial use and was trained on text data spanning 2013 to May 1, 2025, across seven datasets including competitive programming, mathematics, instruction-following, and multi-choice QA domains. No personal data was used in training.

What This Means

NVIDIA's release demonstrates the efficiency gains possible through post-training neural architecture search combined with knowledge distillation and RL optimization. By pruning parameters (73% of parent) while improving throughput via architectural heterogeneity and selective attention, gpt-oss-puzzle-88B targets the inference economics problem for reasoning models—cost per token and latency on production hardware matter more than raw capability. The three reasoning effort modes enable operators to trade accuracy for cost per request, a pattern increasingly important as reasoning models become production infrastructure. This positions NVIDIA's Puzzle framework as a differentiation vector for inference-optimized model development.

Source: huggingface.co ↗

nvidia model-release mixture-of-experts inference-optimization reasoning-models 128k-context gpt-oss

model releaseMay 8, 2026

Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning

Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.

model releaseMay 8, 2026

Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use

Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.

model releaseMay 7, 2026

Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens

Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.