NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture
NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter model designed for text generation and conversational tasks. The model employs a latent mixture-of-experts (MoE) architecture and supports multiple languages including English, French, Spanish, Italian, German, Japanese, and Chinese.
NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter text generation model now available on Hugging Face. The model represents NVIDIA's latest entry in the Nemotron-3 family and uses a latent mixture-of-experts architecture optimized for inference efficiency.
Model Specifications
The Nemotron-3-Super-120B-A12B variant is distributed in BF16 (bfloat16) precision format. The model uses a latent MoE design, which dynamically routes tokens to specialized expert networks rather than using all parameters for every computation. This architectural approach typically reduces computational requirements during inference compared to dense 120B models of equivalent capability.
The model supports text generation and conversational workloads across eight languages: English, French, Spanish, Italian, German, Japanese, and Chinese. NVIDIA trained the model using its Nemotron post-training and pre-training datasets, with technical details available in two research papers (arXiv:2512.20848 and arXiv:2512.20856).
Training and Architecture
Nemotron-3-Super-120B incorporates multi-token prediction (MTP), a training technique that improves model efficiency by predicting multiple tokens simultaneously during generation. The model is compatible with the Hugging Face Transformers library and supports safetensors format for efficient model loading.
As of March 10, 2026, the model has received 70 likes on Hugging Face and 22 downloads. The release is restricted under NVIDIA's custom license, and the model includes endpoints compatibility for inference services.
What This Means
NVIDIA's release of a 120B parameter model with latent MoE architecture signals continued focus on efficient large-model serving. The combination of MoE routing and multi-token prediction suggests the model is optimized for throughput and latency—critical factors for production deployments where serving dense 120B models can be computationally expensive. The multi-language support positions the model for international use cases, though without public benchmarks or performance comparisons, relative capability versus competing 120B models remains unclear.