model releaseDeepSeek

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

TL;DR

DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.

April 24, 2026 · 3:21 AM2 min read

DeepSeek V4 Pro — Quick Specs

Context window1000K tokens

Input$0.0036/1M tokens

Output$0.87/1M tokens

Compare DeepSeek V4 Pro with other models →

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

DeepSeek released two new Mixture-of-Experts language models with one million token context windows: DeepSeek-V4-Pro (1.6 trillion total parameters, 49 billion activated) and DeepSeek-V4-Flash (284 billion total parameters, 13 billion activated).

Technical Architecture

The V4 series introduces three key architectural changes:

Hybrid Attention: The models use a combination of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). At 1M token context, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

Manifold-Constrained Hyper-Connections (mHC): This enhancement to residual connections improves signal propagation stability across layers while maintaining model expressivity.

Muon Optimizer: The training process employs the Muon optimizer for faster convergence and improved stability.

Both models were pre-trained on more than 32 trillion tokens. Post-training used a two-stage approach: independent domain-specific expert cultivation through supervised fine-tuning and reinforcement learning with GRPO, followed by on-policy distillation to consolidate capabilities.

Benchmark Performance

DeepSeek-V4-Pro-Base scores 90.1 on MMLU (5-shot), 90.8 on MMLU-Redux, 73.5 on MMLU-Pro, and 76.8 on HumanEval (0-shot). On long-context tasks, it achieves 51.5 on LongBench-V2.

The instruct version, DeepSeek-V4-Pro-Max (maximum reasoning mode), achieves 87.5 on MMLU-Pro, 93.5 on LiveCodeBench, and a 3206 rating on Codeforces. According to DeepSeek, it matches or exceeds Claude Opus 4.6 Max and GPT-5.4 xHigh on most coding benchmarks while trailing on some agentic tasks.

DeepSeek-V4-Flash-Max, despite its smaller parameter count, achieves comparable reasoning performance to the Pro version with extended thinking time, scoring 3052 on Codeforces and 88.4 on IMOAnswerBench.

Reasoning Modes

The instruct models support three reasoning effort modes:

Non-think: Fast responses without explicit reasoning chains
Think: Outputs reasoning within <think> tags before providing answers
Think Max: Extended reasoning with special system prompts for maximum capability

Performance scales significantly with reasoning budget. V4-Pro improves from 7.7 to 37.7 on HLE benchmark when moving from non-think to Think Max mode.

Availability

All models are available on HuggingFace and ModelScope. The release uses mixed precision: FP4 for MoE expert parameters and FP8 for most other parameters. DeepSeek provides custom encoding scripts instead of standard Jinja chat templates, with examples in the model repository.

What This Means

DeepSeek-V4-Pro represents a significant efficiency gain for long-context processing, reducing computational requirements by 73% while expanding context to 1M tokens. The 3206 Codeforces rating places it among the strongest coding models available, though its performance on complex agentic workflows still trails leading closed-source models. The dual-model release strategy—offering both a large Pro version and smaller Flash version with similar reasoning capabilities—provides deployment flexibility based on latency and resource constraints.

Source: huggingface.co ↗

deepseek moe long-context reasoning open-source benchmark

model releaseJune 5, 2026

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.

model releaseJune 4, 2026

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.

model releaseJune 3, 2026

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

model releaseJune 5, 2026

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

DeepSeek V4 Pro — Quick Specs

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

Technical Architecture

Benchmark Performance

Reasoning Modes

Availability

What This Means

Related Articles

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Comments