model releaseDeepSeek

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

TL;DR

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

2 min read
0

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new Mixture-of-Experts language models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a context length of one million tokens.

Architectural Improvements

The V4 series introduces a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). According to DeepSeek, in the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.

The models incorporate Manifold-Constrained Hyper-Connections (mHC) to strengthen residual connections and were trained using the Muon optimizer. Both models were pre-trained on more than 32 trillion diverse tokens.

Benchmark Performance

DeepSeek-V4-Pro-Max achieves 90.1% on MMLU-Pro, 90.1% on GPQA Diamond, and 93.5% pass@1 on LiveCodeBench. The model scores 3206 on Codeforces rating and 95.2% pass@1 on HMMT 2026 February.

On knowledge benchmarks, the Pro model scores 90.8% on MMLU-Redux (5-shot), 93.1% on C-Eval (5-shot), and 55.2% on Simple-QA verified (25-shot). For long context tasks, it achieves 51.5% on LongBench-V2 in base form and 83.5 MMR on MRCR 1M in instruct form.

DeepSeek-V4-Flash-Base, despite having fewer parameters, achieves 88.7% on MMLU (5-shot), 68.3% on MMLU-Pro (5-shot), and 69.5% pass@1 on HumanEval (0-shot).

Three Reasoning Modes

Both V4 models support three reasoning effort modes:

  • Non-think: Fast responses for routine tasks
  • Think: Conscious logical analysis with visible reasoning tokens
  • Think Max: Extended reasoning for complex problems

The performance difference is substantial. DeepSeek-V4-Pro-Max achieves 37.7% pass@1 on HLE (High-Level Expertise), while the non-think mode scores only 7.7%. On GPQA Diamond, the Max mode reaches 90.1% compared to 72.9% for non-think.

Model Availability

All models are available on Hugging Face with mixed precision formats. The standard versions use FP4 for MoE expert parameters and FP8 for most other parameters. Base models use FP8 mixed precision.

DeepSeek-V4-Flash-DSpark, also available on Hugging Face, is not a new model but the same checkpoint with an additional speculative decoding module attached for faster inference.

What This Means

DeepSeek's 90% reduction in KV cache requirements addresses one of the primary bottlenecks in long-context inference. The architectural changes enabling this efficiency—particularly the hybrid attention mechanism—represent a significant engineering achievement that could influence future model designs. The three-tiered reasoning system provides explicit control over inference costs versus output quality, with dramatic performance differences across modes suggesting that chain-of-thought reasoning remains essential for complex tasks despite the model's scale.

Related Articles

model release

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model release

DeepSeek-V4-Fable: Offensive Security Model Trained on 80,000 CTF Trajectories Achieves 58.7% Solve Rate

Chunjiang Intelligence has released DeepSeek-V4-Fable, an autonomous agent model designed for offensive security research and CTF challenges. The model, distilled from Claude-5-Fable and built on DeepSeek-V4-Flash, was trained on 80,000 verified CTF trajectories and achieves a 58.7% solve rate across held-out security challenges.

model release

OpenAI previews GPT-5.6 to select partners with three variants priced from $1 to $30 per million tokens

OpenAI has begun previewing its GPT-5.6 series to a limited group of trusted partners after government review. The release includes three variants: Sol at $5 input/$30 output per million tokens, Terra at $2.50/$15, and Luna at $1/$6.

model release

OpenAI announces GPT-5.6 series with Sol flagship, Terra at 50% cost of GPT-5.5, and Luna budget model

OpenAI has begun a limited preview of its GPT-5.6 series, introducing three models: Sol (flagship), Terra (2x cheaper than GPT-5.5 with competitive performance), and Luna (lowest cost option). The models are launching first with trusted partners before general availability in coming weeks, following U.S. government preview requirements.

Comments

Loading...