model releaseDeepSeek

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

TL;DR

DeepSeek released V4 in two versions: a 284 billion parameter Flash model and a 1.6 trillion parameter Pro model with 49 billion active parameters. According to DeepSeek, the models use 9.5x-13.7x less memory than V3 through compressed attention mechanisms and FP4/FP8 mixed precision, while supporting a 1 million token context window.

April 24, 2026 · 9:36 PM2 min read

DeepSeek V4 Pro — Quick Specs

Context window1000K tokens

Input$0.0036/1M tokens

Output$0.87/1M tokens

Compare DeepSeek V4 Pro with other models →

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

DeepSeek released V4 on April 24, 2026, offering two open weights models: a 284 billion parameter Flash mixture-of-experts model with 13 billion active parameters, and a 1.6 trillion parameter Pro model with 49 billion active parameters. The company claims the models rival proprietary Western LLMs while dramatically reducing inference costs.

Architecture and efficiency gains

The V4-Pro model was trained on 33 trillion tokens. According to DeepSeek's benchmarks, it outperforms all open weight LLMs and matches leading proprietary models. However, the company's benchmark claims have not been independently verified.

The key technical advancement is a hybrid attention mechanism combining Compressed Sparse Attention and Heavy Compressed Attention. These techniques reduce both compute requirements during inference and memory needed for key-value caches. DeepSeek claims this results in 9.5x-13.7x less memory usage compared to V3.2 while supporting a 1 million token context window.

Both models use mixed FP8 and FP4 precision. The mixture-of-experts weights specifically use FP4 through quantization-aware training—halving memory requirements compared to FP8 at the cost of reduced precision. DeepSeek V3 was among the first open models trained at FP8; V4 pushes further into lower precision territory.

For training, DeepSeek introduced a new optimizer called Muon, designed to speed convergence and improve training stability.

Hardware support and deployment

DeepSeek validated V4 to run on both Nvidia GPUs and Huawei Ascend NPUs. The technical paper confirms the company tested its "fine-grained EP [Expert Parallel] scheme on both Nvidia GPUs and Ascend NPU platforms."

The paper does not specify whether Huawei hardware was used for pre-training or only for inference and post-training reinforcement learning. DeepSeek may have used Nvidia GPUs for initial training and Huawei accelerators for the inference-adjacent reinforcement learning phase.

The models are available on Hugging Face, through DeepSeek's API, and via the company's web service. Pricing has not been disclosed.

What this means

DeepSeek's focus on inference efficiency addresses the primary cost barrier in large language model deployment. Reducing KV cache memory by an order of magnitude while maintaining a 1 million token context window would significantly lower serving costs for providers. The smaller 284B Flash model offers a middle ground between capability and cost.

Validation on Huawei Ascend NPUs is notable given U.S. export restrictions on Nvidia chips to China. If DeepSeek can achieve comparable performance on domestic hardware, it reduces dependency on American semiconductors—though the extent of Huawei chip usage in training versus inference remains unclear. The FP4 quantization strategy suggests DeepSeek is optimizing for hardware with limited precision support or memory constraints.

Source: go.theregister.com ↗

deepseek llm mixture-of-experts inference-optimization huawei ascend-npu fp4-quantization open-weights

model releaseJune 3, 2026

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.

model releaseJune 8, 2026

Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window

Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.

model releaseJune 8, 2026

Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free

Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.

model releaseJune 5, 2026

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

DeepSeek V4 Pro — Quick Specs

DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3

Architecture and efficiency gains

Hardware support and deployment

What this means

Related Articles

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window

Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Comments