MiniMax Releases M3: 428B-Parameter Multimodal Model with 1M Context Window and 15× Decode Speedup
MiniMax has released M3, a multimodal model with approximately 428 billion parameters and 23 billion activated parameters. The model supports a 1 million token context window and uses MiniMax Sparse Attention to achieve 9× prefill and 15× decode speedups compared to its predecessor M2.
MiniMax Releases M3: 428B-Parameter Multimodal Model with 1M Context Window and 15× Decode Speedup
MiniMax has released M3, a multimodal model with approximately 428 billion parameters and 23 billion activated parameters. The model supports a 1 million token context window and uses MiniMax Sparse Attention (MSA) to achieve 9× prefill and 15× decode speedups compared to its predecessor M2, reducing per-token compute to 1/20.
Technical Specifications
M3 uses native multimodal training from the first step, processing text, image, and video inputs through mixed-modality training rather than adapting a text-only model. The model employs MiniMax Sparse Attention, which the company claims dramatically reduces attention compute and memory footprint compared to Grouped Query Attention (GQA) while preserving model quality.
The model features two operating modes: a "thinking" mode for complex reasoning and agentic tasks, and a "non-thinking" mode for latency-sensitive scenarios like chat and code completion. According to MiniMax, M3 achieves frontier-level performance across long-horizon agentic benchmarks.
Pricing details have not been disclosed. The model is available through the MiniMax API and for local deployment via Hugging Face.
Deployment Options
M3 can be deployed locally using three inference frameworks: SGLang, vLLM, and Transformers. MiniMax recommends specific inference parameters: temperature=1.0, top_p=0.95, and top_k=40.
The model supports API access through MiniMax's own API service, with Novita listed as an additional inference provider on Hugging Face. The technical details are available in a research paper on arXiv (arXiv:2606.13392).
What This Means
M3's sparse attention architecture addresses a critical bottleneck in long-context models: compute cost at scale. The claimed 15× decode speedup at 1M tokens, if validated in independent benchmarks, would make M3 significantly more practical for production use cases requiring extended context.
The native multimodal training approach contrasts with common industry practice of adapting text models for visual inputs. This architectural choice suggests MiniMax is betting on deeper semantic integration across modalities, though real-world performance comparisons with models like GPT-4o or Gemini 1.5 Pro will determine whether this approach delivers measurable advantages. The emphasis on agentic capabilities and coding performance positions M3 as a competitor in the autonomous agent and development tools market.
Related Articles
Anthropic releases Fable 5, bringing capabilities of restricted Mythos model to public with $10/$50 per 1M token pricing
Anthropic has released Fable 5, making capabilities from its previously restricted Mythos model available to the public. The company claims Fable 5 beats GPT-5.5, Gemini 3.1 Pro, and its own Opus 4.8 in internal testing, with pricing set at $10 per million input tokens and $50 per million output tokens after a free trial period ending June 22.
Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage
Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.
Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure
Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.
Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif
Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.
Comments
Loading...