DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%
DeepSeek released two preview models: V4 Pro (1.6T total parameters, 49B active) and V4 Flash (284B total, 13B active), both with 1 million token context windows. V4 Pro is priced at $1.74/M input tokens and $3.48/M output—42% cheaper than Claude Sonnet 4.6—while V4 Flash at $0.14/$0.28 per million tokens undercuts all small frontier models.
DeepSeek V4 Pro — Quick Specs
DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%
DeepSeek released two preview models in its V4 series: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both are Mixture of Experts models with 1 million token context windows, released under the MIT license.
Model specifications
DeepSeek-V4-Pro has 1.6 trillion total parameters with 49 billion active parameters. DeepSeek-V4-Flash has 284 billion total parameters with 13 billion active. According to DeepSeek, this makes V4 Pro the largest open weights model available, exceeding Kimi K2.6 (1.1T) and GLM-5.1 (754B), and more than double the size of DeepSeek V3.2 (685B).
The Pro model weighs 865GB on Hugging Face. Flash weighs 160GB.
Pricing comparison
DeepSeek's pricing significantly undercuts existing frontier models:
DeepSeek V4 Flash: $0.14/M input tokens, $0.28/M output tokens DeepSeek V4 Pro: $1.74/M input tokens, $3.48/M output tokens
For comparison:
- GPT-5.4: $2.50/$15 per million tokens
- Claude Sonnet 4.6: $3/$15 per million tokens
- Gemini 3.1 Pro: $2/$12 per million tokens
- Claude Haiku 4.5: $1/$5 per million tokens
- GPT-5.4 Nano: $0.20/$1.25 per million tokens
V4 Flash is the cheapest small model available. V4 Pro costs 42% less than Claude Sonnet 4.6 for input tokens and 77% less for output tokens.
Efficiency improvements
DeepSeek attributes the low pricing to substantial efficiency gains. According to their technical paper, in 1M-token context scenarios, V4 Pro achieves only 27% of the single-token FLOPs and 10% of the KV cache size compared to DeepSeek V3.2. V4 Flash achieves 10% of the FLOPs and 7% of the KV cache size.
Performance benchmarks
DeepSeek claims V4 Pro is competitive with frontier models, with one caveat. According to the company's paper: "Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months."
No independent benchmarks are yet available.
Availability
Both models are available via OpenRouter and downloadable from Hugging Face. The models can be accessed through the standard llm-openrouter plugin.
What this means
DeepSeek's aggressive pricing creates immediate pressure on OpenAI, Anthropic, and Google to justify their premium pricing—or match it. At $1.74 per million input tokens, V4 Pro costs less than half of Claude Sonnet 4.6 while claiming near-frontier performance. If the performance claims hold under independent testing, this represents a fundamental shift in the economics of frontier model deployment. The 90% reduction in KV cache size at 1M tokens also suggests meaningful architectural innovations beyond simple scaling, particularly for long-context applications where memory constraints have been a limiting factor.
Related Articles
NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning
NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4, a 550B parameter model with 55B active parameters, featuring a 1M token context window and configurable reasoning mode. The model uses a hybrid LatentMoE architecture combining Mamba-2, Mixture-of-Experts, and Attention layers with Multi-Token Prediction, trained with NVIDIA's NVFP4 quantization-aware approach.
NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning
NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.
Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window
Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.
Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.
Comments
Loading...