model releaseNVIDIA

NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture

TL;DR

NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter model designed for text generation and conversational tasks. The model employs a latent mixture-of-experts (MoE) architecture and supports multiple languages including English, French, Spanish, Italian, German, Japanese, and Chinese.

1 min read
0

NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter text generation model now available on Hugging Face. The model represents NVIDIA's latest entry in the Nemotron-3 family and uses a latent mixture-of-experts architecture optimized for inference efficiency.

Model Specifications

The Nemotron-3-Super-120B-A12B variant is distributed in BF16 (bfloat16) precision format. The model uses a latent MoE design, which dynamically routes tokens to specialized expert networks rather than using all parameters for every computation. This architectural approach typically reduces computational requirements during inference compared to dense 120B models of equivalent capability.

The model supports text generation and conversational workloads across eight languages: English, French, Spanish, Italian, German, Japanese, and Chinese. NVIDIA trained the model using its Nemotron post-training and pre-training datasets, with technical details available in two research papers (arXiv:2512.20848 and arXiv:2512.20856).

Training and Architecture

Nemotron-3-Super-120B incorporates multi-token prediction (MTP), a training technique that improves model efficiency by predicting multiple tokens simultaneously during generation. The model is compatible with the Hugging Face Transformers library and supports safetensors format for efficient model loading.

As of March 10, 2026, the model has received 70 likes on Hugging Face and 22 downloads. The release is restricted under NVIDIA's custom license, and the model includes endpoints compatibility for inference services.

What This Means

NVIDIA's release of a 120B parameter model with latent MoE architecture signals continued focus on efficient large-model serving. The combination of MoE routing and multi-token prediction suggests the model is optimized for throughput and latency—critical factors for production deployments where serving dense 120B models can be computationally expensive. The multi-language support positions the model for international use cases, though without public benchmarks or performance comparisons, relative capability versus competing 120B models remains unclear.

Related Articles

model release

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.

model release

Alibaba Releases Qwen3.6 Max Preview: 1 Trillion Parameter MoE Model With 262K Context Window

Alibaba Cloud has released Qwen3.6 Max Preview, a proprietary frontier model built on sparse mixture-of-experts architecture with approximately 1 trillion total parameters. The model supports a 262,144-token context window and features integrated thinking mode for multi-turn reasoning, priced at $1.30 per million input tokens and $7.80 per million output tokens.

model release

DeepSeek V4 Pro launches with 1.6 trillion parameters, 1M token context at $0.145 per million input tokens

Chinese AI lab DeepSeek has released preview versions of DeepSeek V4 Flash and V4 Pro, mixture-of-experts models with 1 million token context windows. The V4 Pro has 1.6 trillion total parameters (49 billion active), making it the largest open-weight model available, while both models significantly undercut frontier model pricing.

model release

DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%

DeepSeek released two preview models: V4 Pro (1.6T total parameters, 49B active) and V4 Flash (284B total, 13B active), both with 1 million token context windows. V4 Pro is priced at $1.74/M input tokens and $3.48/M output—42% cheaper than Claude Sonnet 4.6—while V4 Flash at $0.14/$0.28 per million tokens undercuts all small frontier models.

Comments

Loading...