DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.
DeepSeek V4 Flash — Quick Specs
DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter
DeepSeek has released V4 Flash, a Mixture-of-Experts model featuring 284B total parameters with 13B activated per inference pass. The model supports a 1M-token context window and is available at no cost through OpenRouter's API platform.
Technical Specifications
DeepSeek V4 Flash employs a sparse MoE architecture that activates only 13B of its 284B total parameters during each forward pass, designed to reduce inference costs while maintaining performance. According to DeepSeek, the model uses hybrid attention mechanisms for efficient long-context processing.
The model supports reasoning modes with "high" and "xhigh" effort levels, where xhigh maps to maximum reasoning capability. OpenRouter's implementation allows access to the model's step-by-step reasoning process through a reasoning_details array in API responses.
Context and Availability
The 1M-token context window positions V4 Flash among large-context models from competitors like Anthropic's Claude 3.5 Sonnet (200K tokens) and Google's Gemini 1.5 Pro (2M tokens). DeepSeek lists the model's release date as April 24, 2026 on OpenRouter's platform—likely an error in documentation.
OpenRouter reports serving 1.27 trillion tokens weekly for the model across its provider network. The free tier has a 256K-token context limit, reduced from the full 1M capacity.
Target Applications
DeepSeek positions V4 Flash for:
- Coding assistants requiring fast response times
- High-throughput chat systems
- Agent workflows with multiple API calls
- Applications where cost efficiency outweighs maximum capability
The MoE architecture aims to deliver faster inference than dense models of similar capability by activating fewer parameters per request.
What This Means
DeepSeek V4 Flash represents China-based DeepSeek's continued push into efficiency-optimized large language models, following their earlier V2 and V3 releases. The free availability through OpenRouter lowers barriers for developers testing long-context applications, though the reduced 256K free tier context limit may push production workloads to paid alternatives. The 284B-13B MoE configuration suggests DeepSeek is prioritizing inference cost over raw capability, betting that most applications don't require full dense model computation for acceptable performance.
Related Articles
Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use
Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.
InclusionAI Releases Ring-2.6-1T: 1 Trillion Parameter Thinking Model with 63B Active Parameters
InclusionAI has released Ring-2.6-1T, a 1 trillion parameter-scale model with 63 billion active parameters and a 262,144-token context window. The model features adaptive reasoning modes and is designed for coding agents, tool use, and long-horizon task execution.
Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks
Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.
Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.
Comments
Loading...