DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost
DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.
DeepSeek V4 Flash — Quick Specs
DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost
DeepSeek released two Mixture-of-Experts language models: DeepSeek-V4-Flash with 284B total parameters (13B activated) and DeepSeek-V4-Pro with 1.6T total parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2 in 1M-token context settings.
Technical Architecture
The V4 series introduces three key architectural changes:
Hybrid Attention: Combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency. This enables the dramatic reduction in computational requirements at million-token context lengths.
Manifold-Constrained Hyper-Connections (mHC): Strengthens conventional residual connections to enhance signal propagation stability across layers while preserving model expressivity.
Muon Optimizer: Employed for faster convergence and greater training stability during pre-training.
Both models were pre-trained on more than 32 trillion tokens and use mixed precision: FP4 for MoE expert parameters and FP8 for most other parameters in the post-trained versions.
Benchmark Performance
According to DeepSeek, V4-Flash-Max achieves competitive scores against frontier models:
- MMLU-Pro: 86.2% (compared to GPT-4o's 87.5% and Gemini 2.0 Pro's 91.0%)
- LiveCodeBench: 91.6% pass@1 (versus Gemini 2.0 Pro's 91.7%)
- Codeforces Rating: 3052 in Max mode (GPT-4o achieves 3168)
- GPQA Diamond: 88.1% pass@1
- SWE Verified: 79.0% resolved
V4-Flash-Base scores 88.7% on MMLU (5-shot) and 69.5% on HumanEval (0-shot), compared to V4-Pro-Base's 90.1% and 76.8% respectively.
Reasoning Modes
Both models support three reasoning effort modes:
- Non-think: Fast, intuitive responses for routine tasks
- Think: Conscious logical analysis with visible reasoning process
- Think Max: Maximum reasoning effort with special system prompts
In Max mode, V4-Flash achieves 88.4% on IMOAnswerBench versus 89.8% for V4-Pro. The gap narrows significantly on complex reasoning tasks when given larger thinking budgets.
Availability
All four model variants (V4-Flash-Base, V4-Flash, V4-Pro-Base, V4-Pro) are available on Hugging Face and ModelScope. DeepSeek has not disclosed pricing per million tokens. The models use a custom chat template encoding system rather than Jinja format, with Python scripts provided in the repository.
What This Means
DeepSeek-V4-Flash demonstrates that smaller activated parameter counts (13B versus 49B) can achieve near-parity with larger models on reasoning tasks when given sufficient compute budget through thinking modes. The 73% reduction in inference FLOPs at million-token context represents a significant efficiency improvement for long-context applications. The performance gap with closed-source frontier models remains substantial on knowledge-intensive benchmarks (SimpleQA-Verified: 34.1% versus Gemini's 75.6%), but narrows considerably on coding and mathematical reasoning tasks.
Related Articles
NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning
NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.
Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window
Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.
Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows
Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.
Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.
Comments
Loading...