DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new Mixture-of-Experts language models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a context length of one million tokens.
Architectural Improvements
The V4 series introduces a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). According to DeepSeek, in the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
The models incorporate Manifold-Constrained Hyper-Connections (mHC) to strengthen residual connections and were trained using the Muon optimizer. Both models were pre-trained on more than 32 trillion diverse tokens.
Benchmark Performance
DeepSeek-V4-Pro-Max achieves 90.1% on MMLU-Pro, 90.1% on GPQA Diamond, and 93.5% pass@1 on LiveCodeBench. The model scores 3206 on Codeforces rating and 95.2% pass@1 on HMMT 2026 February.
On knowledge benchmarks, the Pro model scores 90.8% on MMLU-Redux (5-shot), 93.1% on C-Eval (5-shot), and 55.2% on Simple-QA verified (25-shot). For long context tasks, it achieves 51.5% on LongBench-V2 in base form and 83.5 MMR on MRCR 1M in instruct form.
DeepSeek-V4-Flash-Base, despite having fewer parameters, achieves 88.7% on MMLU (5-shot), 68.3% on MMLU-Pro (5-shot), and 69.5% pass@1 on HumanEval (0-shot).
Three Reasoning Modes
Both V4 models support three reasoning effort modes:
- Non-think: Fast responses for routine tasks
- Think: Conscious logical analysis with visible reasoning tokens
- Think Max: Extended reasoning for complex problems
The performance difference is substantial. DeepSeek-V4-Pro-Max achieves 37.7% pass@1 on HLE (High-Level Expertise), while the non-think mode scores only 7.7%. On GPQA Diamond, the Max mode reaches 90.1% compared to 72.9% for non-think.
Model Availability
All models are available on Hugging Face with mixed precision formats. The standard versions use FP4 for MoE expert parameters and FP8 for most other parameters. Base models use FP8 mixed precision.
DeepSeek-V4-Flash-DSpark, also available on Hugging Face, is not a new model but the same checkpoint with an additional speculative decoding module attached for faster inference.
What This Means
DeepSeek's 90% reduction in KV cache requirements addresses one of the primary bottlenecks in long-context inference. The architectural changes enabling this efficiency—particularly the hybrid attention mechanism—represent a significant engineering achievement that could influence future model designs. The three-tiered reasoning system provides explicit control over inference costs versus output quality, with dramatic performance differences across modes suggesting that chain-of-thought reasoning remains essential for complex tasks despite the model's scale.
Related Articles
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
DeepSeek-V4-Fable: Offensive Security Model Trained on 80,000 CTF Trajectories Achieves 58.7% Solve Rate
Chunjiang Intelligence has released DeepSeek-V4-Fable, an autonomous agent model designed for offensive security research and CTF challenges. The model, distilled from Claude-5-Fable and built on DeepSeek-V4-Flash, was trained on 80,000 verified CTF trajectories and achieves a 58.7% solve rate across held-out security challenges.
OpenAI previews GPT-5.6 to select partners with three variants priced from $1 to $30 per million tokens
OpenAI has begun previewing its GPT-5.6 series to a limited group of trusted partners after government review. The release includes three variants: Sol at $5 input/$30 output per million tokens, Terra at $2.50/$15, and Luna at $1/$6.
OpenAI announces GPT-5.6 series with Sol flagship, Terra at 50% cost of GPT-5.5, and Luna budget model
OpenAI has begun a limited preview of its GPT-5.6 series, introducing three models: Sol (flagship), Terra (2x cheaper than GPT-5.5 with competitive performance), and Luna (lowest cost option). The models are launching first with trusted partners before general availability in coming weeks, following U.S. government preview requirements.
Comments
Loading...