DeepSeek V4 Flash Released: 284B Parameter MoE Model with 1M Context Window at $0.14 per Million Tokens
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per request. The model supports a 1,048,576-token context window and is priced at $0.14 per million input tokens and $0.28 per million output tokens.
DeepSeek V4 Flash — Quick Specs
DeepSeek V4 Flash Released: 284B Parameter MoE Model with 1M Context Window at $0.14 per Million Tokens
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per request. The model supports a 1,048,576-token context window and is priced at $0.14 per million input tokens and $0.28 per million output tokens.
Model Architecture and Capabilities
DeepSeek V4 Flash uses a sparse Mixture-of-Experts architecture that activates only 13B of its 284B total parameters for each inference request. According to DeepSeek, the model includes hybrid attention mechanisms designed for efficient long-context processing.
The model supports configurable reasoning modes, allowing it to show step-by-step thinking processes. DeepSeek claims the model maintains strong performance on reasoning and coding tasks despite its efficiency optimizations.
Pricing and Availability
The model is available through OpenRouter at:
- Input: $0.14 per million tokens
- Output: $0.28 per million tokens
These prices position V4 Flash as a cost-effective option for high-throughput workloads compared to larger models with similar context windows.
Target Use Cases
DeepSeek designed V4 Flash for applications requiring fast inference and high throughput, including:
- Coding assistants
- Chat systems
- Agent workflows
The model's sparse activation pattern (activating only 4.6% of total parameters) enables faster inference speeds while attempting to preserve model quality.
Technical Details
Release date: April 24, 2026 (as listed on OpenRouter) Context window: 1,048,576 tokens Architecture: Sparse Mixture-of-Experts Reasoning support: Configurable reasoning modes with exposed thinking processes
What This Means
DeepSeek V4 Flash continues the trend of using sparse MoE architectures to deliver capable models at lower inference costs. The 13B activated parameter count per request allows for faster processing than dense models of similar capability, while the 1M token context window matches the extended context offerings from competitors like Anthropic and Google. The $0.14/$0.28 per million token pricing undercuts many competing models with similar context lengths, potentially making it attractive for high-volume production deployments where cost per token matters more than absolute peak performance.
Related Articles
Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows
Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.
NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning
NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.
Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window
Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.
Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.
Comments
Loading...