DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context
DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.
DeepSeek V4 Pro — Quick Specs
DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context
DeepSeek released two new Mixture-of-Experts language models with one million token context windows: DeepSeek-V4-Pro (1.6 trillion total parameters, 49 billion activated) and DeepSeek-V4-Flash (284 billion total parameters, 13 billion activated).
Technical Architecture
The V4 series introduces three key architectural changes:
Hybrid Attention: The models use a combination of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). At 1M token context, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC): This enhancement to residual connections improves signal propagation stability across layers while maintaining model expressivity.
Muon Optimizer: The training process employs the Muon optimizer for faster convergence and improved stability.
Both models were pre-trained on more than 32 trillion tokens. Post-training used a two-stage approach: independent domain-specific expert cultivation through supervised fine-tuning and reinforcement learning with GRPO, followed by on-policy distillation to consolidate capabilities.
Benchmark Performance
DeepSeek-V4-Pro-Base scores 90.1 on MMLU (5-shot), 90.8 on MMLU-Redux, 73.5 on MMLU-Pro, and 76.8 on HumanEval (0-shot). On long-context tasks, it achieves 51.5 on LongBench-V2.
The instruct version, DeepSeek-V4-Pro-Max (maximum reasoning mode), achieves 87.5 on MMLU-Pro, 93.5 on LiveCodeBench, and a 3206 rating on Codeforces. According to DeepSeek, it matches or exceeds Claude Opus 4.6 Max and GPT-5.4 xHigh on most coding benchmarks while trailing on some agentic tasks.
DeepSeek-V4-Flash-Max, despite its smaller parameter count, achieves comparable reasoning performance to the Pro version with extended thinking time, scoring 3052 on Codeforces and 88.4 on IMOAnswerBench.
Reasoning Modes
The instruct models support three reasoning effort modes:
- Non-think: Fast responses without explicit reasoning chains
- Think: Outputs reasoning within
<think>tags before providing answers - Think Max: Extended reasoning with special system prompts for maximum capability
Performance scales significantly with reasoning budget. V4-Pro improves from 7.7 to 37.7 on HLE benchmark when moving from non-think to Think Max mode.
Availability
All models are available on HuggingFace and ModelScope. The release uses mixed precision: FP4 for MoE expert parameters and FP8 for most other parameters. DeepSeek provides custom encoding scripts instead of standard Jinja chat templates, with examples in the model repository.
What This Means
DeepSeek-V4-Pro represents a significant efficiency gain for long-context processing, reducing computational requirements by 73% while expanding context to 1M tokens. The 3206 Codeforces rating places it among the strongest coding models available, though its performance on complex agentic workflows still trails leading closed-source models. The dual-model release strategy—offering both a large Pro version and smaller Flash version with similar reasoning capabilities—provides deployment flexibility based on latency and resource constraints.
Related Articles
DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost
DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.
DeepSeek V4 Flash Released: 284B Parameter MoE Model with 1M Context Window at $0.14 per Million Tokens
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per request. The model supports a 1,048,576-token context window and is priced at $0.14 per million input tokens and $0.28 per million output tokens.
DeepSeek releases V4 model preview with agent optimization, pricing undisclosed
DeepSeek released a preview of its V4 large language model on April 24, 2026, available in 'pro' and 'flash' versions. The Hangzhou-based company claims the open-source model achieves strong performance on agent-based tasks and has been optimized for tools like Anthropic's Claude Code and OpenClaw.
DeepSeek Releases V4-Pro-Base with 1.6 Trillion Parameters
DeepSeek has released DeepSeek-V4-Pro-Base, a 1.6 trillion parameter foundation model now available on Hugging Face. The base model uses BF16 precision for weights and includes support for F8_E4M3, I64, and F32 tensor types.
Comments
Loading...