model releaseNVIDIA

NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture

TL;DR

NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter model designed for text generation and conversational tasks. The model employs a latent mixture-of-experts (MoE) architecture and supports multiple languages including English, French, Spanish, Italian, German, Japanese, and Chinese.

1 min read
0

NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter text generation model now available on Hugging Face. The model represents NVIDIA's latest entry in the Nemotron-3 family and uses a latent mixture-of-experts architecture optimized for inference efficiency.

Model Specifications

The Nemotron-3-Super-120B-A12B variant is distributed in BF16 (bfloat16) precision format. The model uses a latent MoE design, which dynamically routes tokens to specialized expert networks rather than using all parameters for every computation. This architectural approach typically reduces computational requirements during inference compared to dense 120B models of equivalent capability.

The model supports text generation and conversational workloads across eight languages: English, French, Spanish, Italian, German, Japanese, and Chinese. NVIDIA trained the model using its Nemotron post-training and pre-training datasets, with technical details available in two research papers (arXiv:2512.20848 and arXiv:2512.20856).

Training and Architecture

Nemotron-3-Super-120B incorporates multi-token prediction (MTP), a training technique that improves model efficiency by predicting multiple tokens simultaneously during generation. The model is compatible with the Hugging Face Transformers library and supports safetensors format for efficient model loading.

As of March 10, 2026, the model has received 70 likes on Hugging Face and 22 downloads. The release is restricted under NVIDIA's custom license, and the model includes endpoints compatibility for inference services.

What This Means

NVIDIA's release of a 120B parameter model with latent MoE architecture signals continued focus on efficient large-model serving. The combination of MoE routing and multi-token prediction suggests the model is optimized for throughput and latency—critical factors for production deployments where serving dense 120B models can be computationally expensive. The multi-language support positions the model for international use cases, though without public benchmarks or performance comparisons, relative capability versus competing 120B models remains unclear.

Related Articles

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window

Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.

model release

Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free

Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.

Comments

Loading...