model releaseNVIDIA

NVIDIA releases Nemotron-3-Nano-4B, a 4B parameter model for edge AI with 262K context window

TL;DR

NVIDIA released Nemotron-3-Nano-4B-GGUF on March 16, 2026, a 4-billion parameter small language model (SLM) designed for edge deployment on devices like Jetson Thor and GeForce RTX. The model features a hybrid Mamba-2 and Transformer architecture with a 262K token context window and supports both reasoning and non-reasoning modes via system prompts.

2 min read

NVIDIA released Nemotron-3-Nano-4B-GGUF on March 16, 2026, a quantized (Q4_K_M) version of its 4-billion parameter small language model designed specifically for edge deployment.

Model Specifications

The model contains 3.97 billion parameters and uses a hybrid architecture combining Mamba-2 and MLP layers with only four Attention layers. It supports a context window of up to 262,000 tokens, enabling processing of lengthy documents on edge devices. The model was compressed from NVIDIA-Nemotron-Nano-9B-v2 using the Nemotron Elastic framework.

Nemotron-3-Nano-4B is designed as a unified model for both reasoning and non-reasoning tasks. Users can control reasoning capabilities through system prompts—disabling reasoning traces slightly reduces accuracy but lowers computational overhead, while enabling them improves solution quality on complex tasks.

Training Data

The model was trained on more than 10 trillion tokens with a data cutoff of September 2024. Training data spans multiple domains including code, legal, math, science, and finance, sourced from webpages, dialogue, articles, and other written materials in English and multiple languages (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese).

NVIDIA incorporated synthetic reasoning traces from several sources including DeepSeek R1, Qwen3-235B, and Nemotron 4 340B to improve reasoning capabilities. The post-training corpus combines automated, human, and synthetic labeling methods.

Benchmark Performance

In reasoning-off mode, the Q4_K_M quantized version achieved:

  • IFBench-Prompt: 46.9
  • IFBench-Instruction: 49.6
  • IFEval-Prompt: 81.5
  • IFEval-Instruction: 83.9
  • HaluEval: 62.4
  • RULER (128K context): 91.2

Quantization to Q4_K_M showed mixed results compared to the FP8 version, with improvements on IFBench tasks but slight decreases on IFEval-Instruction and Orak benchmarks.

Deployment and Use Cases

The model targets edge platforms including NVIDIA Jetson Thor, GeForce RTX, and DGX Spark. Intended applications include AI gaming NPCs (teammates and companions), local voice assistants for devices and apps, and IoT automation. NVIDIA optimized the model to run on NVIDIA GPU-accelerated systems using CUDA libraries and NeMo 25.07 runtime.

The model is ready for commercial use under the NVIDIA Nemotron Open Model License. It supports inference via llama.cpp with OpenAI-compatible API server capabilities.

What This Means

Nemotron-3-Nano-4B represents NVIDIA's push toward practical edge AI, addressing the gap between massive frontier models and resource-constrained devices. The 262K context window on a 4B parameter model is notable for edge deployment, though benchmark scores suggest performance trade-offs compared to larger models. The reasoning mode toggle offers developers flexibility between accuracy and speed—critical for edge inference. By leveraging synthetic data from leading reasoning models (DeepSeek R1, Qwen3) and open-sourcing the model, NVIDIA positions itself in the competitive small language model space dominated by alternatives like Mistral and Meta's Llama variants, but with explicit optimization for gaming and IoT use cases.

Related Articles

funding

Nvidia to spend $26B on open-weight AI models, filing reveals

Nvidia will invest $26 billion over the next five years to build open-weight AI models, according to a 2025 financial filing confirmed by executives. The move signals a strategic shift from chipmaker to AI frontier lab, with the company releasing Nemotron 3 Super (128B parameters) and claiming it outperforms GPT-OSS on multiple benchmarks.

model release

Nvidia releases Nemotron 3 Super: 120B MoE model with 1M token context

Nvidia has released Nemotron 3 Super, a 120-billion parameter hybrid Mamba-Transformer Mixture-of-Experts model that activates only 12 billion parameters during inference. The open-weight model features a 1-million token context window, multi-token prediction capabilities, and pricing at $0.10 per million input tokens and $0.50 per million output tokens.

product update

NVIDIA Nemotron 3 Super now available on Amazon Bedrock with 256K context window

NVIDIA Nemotron 3 Super, a hybrid Mixture of Experts model with 120B parameters and 12B active parameters, is now available as a fully managed model on Amazon Bedrock. The model supports up to 256K token context length and claims 5x higher throughput efficiency over the previous Nemotron Super and 2x higher accuracy on reasoning tasks.

model release

NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture

NVIDIA has released Nemotron-3-Super-120B-A12B-NVFP4, a 120-billion parameter text generation model featuring a latent Mixture-of-Experts (MoE) architecture. The model supports 8 languages including English, French, Spanish, Italian, German, Japanese, and Chinese, and is available on Hugging Face with 8-bit quantization support through NVIDIA's ModelOpt toolkit.

NVIDIA Nemotron-3-Nano-4B: 4B Edge Model | TPS