model releaseDeepSeek

DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%

TL;DR

DeepSeek released two preview models: V4 Pro (1.6T total parameters, 49B active) and V4 Flash (284B total, 13B active), both with 1 million token context windows. V4 Pro is priced at $1.74/M input tokens and $3.48/M output—42% cheaper than Claude Sonnet 4.6—while V4 Flash at $0.14/$0.28 per million tokens undercuts all small frontier models.

April 24, 2026 · 6:21 AM2 min read

DeepSeek V4 Pro — Quick Specs

Context window1000K tokens

Input$0.0036/1M tokens

Output$0.87/1M tokens

Compare DeepSeek V4 Pro with other models →

DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%

DeepSeek released two preview models in its V4 series: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both are Mixture of Experts models with 1 million token context windows, released under the MIT license.

Model specifications

DeepSeek-V4-Pro has 1.6 trillion total parameters with 49 billion active parameters. DeepSeek-V4-Flash has 284 billion total parameters with 13 billion active. According to DeepSeek, this makes V4 Pro the largest open weights model available, exceeding Kimi K2.6 (1.1T) and GLM-5.1 (754B), and more than double the size of DeepSeek V3.2 (685B).

The Pro model weighs 865GB on Hugging Face. Flash weighs 160GB.

Pricing comparison

DeepSeek's pricing significantly undercuts existing frontier models:

DeepSeek V4 Flash: $0.14/M input tokens, $0.28/M output tokens DeepSeek V4 Pro: $1.74/M input tokens, $3.48/M output tokens

For comparison:

GPT-5.4: $2.50/$15 per million tokens
Claude Sonnet 4.6: $3/$15 per million tokens
Gemini 3.1 Pro: $2/$12 per million tokens
Claude Haiku 4.5: $1/$5 per million tokens
GPT-5.4 Nano: $0.20/$1.25 per million tokens

V4 Flash is the cheapest small model available. V4 Pro costs 42% less than Claude Sonnet 4.6 for input tokens and 77% less for output tokens.

Efficiency improvements

DeepSeek attributes the low pricing to substantial efficiency gains. According to their technical paper, in 1M-token context scenarios, V4 Pro achieves only 27% of the single-token FLOPs and 10% of the KV cache size compared to DeepSeek V3.2. V4 Flash achieves 10% of the FLOPs and 7% of the KV cache size.

Performance benchmarks

DeepSeek claims V4 Pro is competitive with frontier models, with one caveat. According to the company's paper: "Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months."

No independent benchmarks are yet available.

Availability

Both models are available via OpenRouter and downloadable from Hugging Face. The models can be accessed through the standard llm-openrouter plugin.

What this means

DeepSeek's aggressive pricing creates immediate pressure on OpenAI, Anthropic, and Google to justify their premium pricing—or match it. At $1.74 per million input tokens, V4 Pro costs less than half of Claude Sonnet 4.6 while claiming near-frontier performance. If the performance claims hold under independent testing, this represents a fundamental shift in the economics of frontier model deployment. The 90% reduction in KV cache size at 1M tokens also suggests meaningful architectural innovations beyond simple scaling, particularly for long-context applications where memory constraints have been a limiting factor.

Source: simonwillison.net ↗

deepseek model-release pricing mixture-of-experts open-weights long-context china-ai

model releaseJune 5, 2026

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4, a 550B parameter model with 55B active parameters, featuring a 1M token context window and configurable reasoning mode. The model uses a hybrid LatentMoE architecture combining Mamba-2, Mixture-of-Experts, and Attention layers with Multi-Token Prediction, trained with NVIDIA's NVFP4 quantization-aware approach.

model releaseJune 5, 2026

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.

model releaseJune 4, 2026

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.

model releaseJune 3, 2026

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.

DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%

DeepSeek V4 Pro — Quick Specs

DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%

Model specifications

Pricing comparison

Efficiency improvements

Performance benchmarks

Availability

What this means

Related Articles

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Comments