model releaseDeepSeek

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

TL;DR

DeepSeek released V4 in two versions: a 284 billion parameter Flash model and a 1.6 trillion parameter Pro model with 49 billion active parameters. According to DeepSeek, the models use 9.5x-13.7x less memory than V3 through compressed attention mechanisms and FP4/FP8 mixed precision, while supporting a 1 million token context window.

2 min read
0

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

DeepSeek released V4 on April 24, 2026, offering two open weights models: a 284 billion parameter Flash mixture-of-experts model with 13 billion active parameters, and a 1.6 trillion parameter Pro model with 49 billion active parameters. The company claims the models rival proprietary Western LLMs while dramatically reducing inference costs.

Architecture and efficiency gains

The V4-Pro model was trained on 33 trillion tokens. According to DeepSeek's benchmarks, it outperforms all open weight LLMs and matches leading proprietary models. However, the company's benchmark claims have not been independently verified.

The key technical advancement is a hybrid attention mechanism combining Compressed Sparse Attention and Heavy Compressed Attention. These techniques reduce both compute requirements during inference and memory needed for key-value caches. DeepSeek claims this results in 9.5x-13.7x less memory usage compared to V3.2 while supporting a 1 million token context window.

Both models use mixed FP8 and FP4 precision. The mixture-of-experts weights specifically use FP4 through quantization-aware training—halving memory requirements compared to FP8 at the cost of reduced precision. DeepSeek V3 was among the first open models trained at FP8; V4 pushes further into lower precision territory.

For training, DeepSeek introduced a new optimizer called Muon, designed to speed convergence and improve training stability.

Hardware support and deployment

DeepSeek validated V4 to run on both Nvidia GPUs and Huawei Ascend NPUs. The technical paper confirms the company tested its "fine-grained EP [Expert Parallel] scheme on both Nvidia GPUs and Ascend NPU platforms."

The paper does not specify whether Huawei hardware was used for pre-training or only for inference and post-training reinforcement learning. DeepSeek may have used Nvidia GPUs for initial training and Huawei accelerators for the inference-adjacent reinforcement learning phase.

The models are available on Hugging Face, through DeepSeek's API, and via the company's web service. Pricing has not been disclosed.

What this means

DeepSeek's focus on inference efficiency addresses the primary cost barrier in large language model deployment. Reducing KV cache memory by an order of magnitude while maintaining a 1 million token context window would significantly lower serving costs for providers. The smaller 284B Flash model offers a middle ground between capability and cost.

Validation on Huawei Ascend NPUs is notable given U.S. export restrictions on Nvidia chips to China. If DeepSeek can achieve comparable performance on domestic hardware, it reduces dependency on American semiconductors—though the extent of Huawei chip usage in training versus inference remains unclear. The FP4 quantization strategy suggests DeepSeek is optimizing for hardware with limited precision support or memory constraints.

Related Articles

model release

DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%

DeepSeek released two preview models: V4 Pro (1.6T total parameters, 49B active) and V4 Flash (284B total, 13B active), both with 1 million token context windows. V4 Pro is priced at $1.74/M input tokens and $3.48/M output—42% cheaper than Claude Sonnet 4.6—while V4 Flash at $0.14/$0.28 per million tokens undercuts all small frontier models.

model release

DeepSeek V4 Pro launches with 1.6 trillion parameters, 1M token context at $0.145 per million input tokens

Chinese AI lab DeepSeek has released preview versions of DeepSeek V4 Flash and V4 Pro, mixture-of-experts models with 1 million token context windows. The V4 Pro has 1.6 trillion total parameters (49 billion active), making it the largest open-weight model available, while both models significantly undercut frontier model pricing.

model release

DeepSeek releases V4 preview, claims parity with GPT-4o and Claude 3.5 Sonnet

DeepSeek released a preview of its V4 model on April 24, 2026, claiming the open-source system matches leading closed-source models from Anthropic, Google, and OpenAI. The company emphasized improved coding capabilities and compatibility with domestic Huawei chips, but did not disclose training costs or hardware specifications.

model release

DeepSeek releases V4 model preview with agent optimization, pricing undisclosed

DeepSeek released a preview of its V4 large language model on April 24, 2026, available in 'pro' and 'flash' versions. The Hangzhou-based company claims the open-source model achieves strong performance on agent-based tasks and has been optimized for tools like Anthropic's Claude Code and OpenClaw.

Comments

Loading...