model releaseDeepSeek

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

TL;DR

DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.

2 min read
0

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

DeepSeek released two Mixture-of-Experts language models: DeepSeek-V4-Flash with 284B total parameters (13B activated) and DeepSeek-V4-Pro with 1.6T total parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2 in 1M-token context settings.

Technical Architecture

The V4 series introduces three key architectural changes:

Hybrid Attention: Combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency. This enables the dramatic reduction in computational requirements at million-token context lengths.

Manifold-Constrained Hyper-Connections (mHC): Strengthens conventional residual connections to enhance signal propagation stability across layers while preserving model expressivity.

Muon Optimizer: Employed for faster convergence and greater training stability during pre-training.

Both models were pre-trained on more than 32 trillion tokens and use mixed precision: FP4 for MoE expert parameters and FP8 for most other parameters in the post-trained versions.

Benchmark Performance

According to DeepSeek, V4-Flash-Max achieves competitive scores against frontier models:

  • MMLU-Pro: 86.2% (compared to GPT-4o's 87.5% and Gemini 2.0 Pro's 91.0%)
  • LiveCodeBench: 91.6% pass@1 (versus Gemini 2.0 Pro's 91.7%)
  • Codeforces Rating: 3052 in Max mode (GPT-4o achieves 3168)
  • GPQA Diamond: 88.1% pass@1
  • SWE Verified: 79.0% resolved

V4-Flash-Base scores 88.7% on MMLU (5-shot) and 69.5% on HumanEval (0-shot), compared to V4-Pro-Base's 90.1% and 76.8% respectively.

Reasoning Modes

Both models support three reasoning effort modes:

  • Non-think: Fast, intuitive responses for routine tasks
  • Think: Conscious logical analysis with visible reasoning process
  • Think Max: Maximum reasoning effort with special system prompts

In Max mode, V4-Flash achieves 88.4% on IMOAnswerBench versus 89.8% for V4-Pro. The gap narrows significantly on complex reasoning tasks when given larger thinking budgets.

Availability

All four model variants (V4-Flash-Base, V4-Flash, V4-Pro-Base, V4-Pro) are available on Hugging Face and ModelScope. DeepSeek has not disclosed pricing per million tokens. The models use a custom chat template encoding system rather than Jinja format, with Python scripts provided in the repository.

What This Means

DeepSeek-V4-Flash demonstrates that smaller activated parameter counts (13B versus 49B) can achieve near-parity with larger models on reasoning tasks when given sufficient compute budget through thinking modes. The 73% reduction in inference FLOPs at million-token context represents a significant efficiency improvement for long-context applications. The performance gap with closed-source frontier models remains substantial on knowledge-intensive benchmarks (SimpleQA-Verified: 34.1% versus Gemini's 75.6%), but narrows considerably on coding and mathematical reasoning tasks.

Related Articles

model release

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.

model release

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

model release

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.

Comments

Loading...