model releaseDeepSeek

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

TL;DR

DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.

2 min read
0

DeepSeek V4 Flash — Quick Specs

Context window1049K tokens
Input$0.14/1M tokens
Output$0.28/1M tokens

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

DeepSeek released two Mixture-of-Experts language models: DeepSeek-V4-Flash with 284B total parameters (13B activated) and DeepSeek-V4-Pro with 1.6T total parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2 in 1M-token context settings.

Technical Architecture

The V4 series introduces three key architectural changes:

Hybrid Attention: Combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency. This enables the dramatic reduction in computational requirements at million-token context lengths.

Manifold-Constrained Hyper-Connections (mHC): Strengthens conventional residual connections to enhance signal propagation stability across layers while preserving model expressivity.

Muon Optimizer: Employed for faster convergence and greater training stability during pre-training.

Both models were pre-trained on more than 32 trillion tokens and use mixed precision: FP4 for MoE expert parameters and FP8 for most other parameters in the post-trained versions.

Benchmark Performance

According to DeepSeek, V4-Flash-Max achieves competitive scores against frontier models:

  • MMLU-Pro: 86.2% (compared to GPT-4o's 87.5% and Gemini 2.0 Pro's 91.0%)
  • LiveCodeBench: 91.6% pass@1 (versus Gemini 2.0 Pro's 91.7%)
  • Codeforces Rating: 3052 in Max mode (GPT-4o achieves 3168)
  • GPQA Diamond: 88.1% pass@1
  • SWE Verified: 79.0% resolved

V4-Flash-Base scores 88.7% on MMLU (5-shot) and 69.5% on HumanEval (0-shot), compared to V4-Pro-Base's 90.1% and 76.8% respectively.

Reasoning Modes

Both models support three reasoning effort modes:

  • Non-think: Fast, intuitive responses for routine tasks
  • Think: Conscious logical analysis with visible reasoning process
  • Think Max: Maximum reasoning effort with special system prompts

In Max mode, V4-Flash achieves 88.4% on IMOAnswerBench versus 89.8% for V4-Pro. The gap narrows significantly on complex reasoning tasks when given larger thinking budgets.

Availability

All four model variants (V4-Flash-Base, V4-Flash, V4-Pro-Base, V4-Pro) are available on Hugging Face and ModelScope. DeepSeek has not disclosed pricing per million tokens. The models use a custom chat template encoding system rather than Jinja format, with Python scripts provided in the repository.

What This Means

DeepSeek-V4-Flash demonstrates that smaller activated parameter counts (13B versus 49B) can achieve near-parity with larger models on reasoning tasks when given sufficient compute budget through thinking modes. The 73% reduction in inference FLOPs at million-token context represents a significant efficiency improvement for long-context applications. The performance gap with closed-source frontier models remains substantial on knowledge-intensive benchmarks (SimpleQA-Verified: 34.1% versus Gemini's 75.6%), but narrows considerably on coding and mathematical reasoning tasks.

Related Articles

model release

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.

model release

DeepSeek V4 Flash Released: 284B Parameter MoE Model with 1M Context Window at $0.14 per Million Tokens

DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per request. The model supports a 1,048,576-token context window and is priced at $0.14 per million input tokens and $0.28 per million output tokens.

model release

DeepSeek Releases V4 Pro: 1.6T Parameter MoE Model with 1M Token Context at $1.74/M Input Tokens

DeepSeek has released V4 Pro, a Mixture-of-Experts model with 1.6 trillion total parameters and 49 billion activated parameters. The model supports a 1-million-token context window and costs $1.74 per million input tokens and $3.48 per million output tokens.

model release

Tencent Releases Hy3-Preview: 295B-Parameter MoE Model with 21B Active Parameters

Tencent has released Hy3-preview, a 295-billion-parameter Mixture-of-Experts model with 21 billion active parameters and a 256K context window. The model scores 76.28% on MATH and 34.86% on LiveCodeBench-v6, with particularly strong performance on coding agent tasks.

Comments

Loading...