Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.
Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts (MoE) model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window.
Architecture and specifications
MiMo-V2.5 uses a sparse MoE architecture with 256 routed experts, activating 8 experts per token. The model consists of 48 layers total: 1 dense layer and 47 MoE layers, with 39 using sliding window attention (SWA) and 9 using full attention.
Key specifications:
- Total parameters: 310B (15B activated per forward pass)
- Context window: Up to 1M tokens
- Hidden size: 4096
- Attention heads: 64 query heads, split between 8 KV heads for global attention and 4 for sliding window attention
- Sliding window size: 128 tokens
- Training data: ~48T tokens using FP8 mixed precision
Multimodal encoders
The model includes dedicated encoders for vision and audio:
Vision encoder: 729M-parameter Vision Transformer (ViT) with 28 layers—24 using sliding window attention and 4 using full attention. The encoder employs a hybrid window-attention pattern alternating between 1-D row and column windows with 64-token sliding windows.
Audio encoder: 261M-parameter Audio Transformer with 24 layers, split evenly between 12 sliding window attention layers and 12 full attention layers. The encoder was initialized from MiMo-Audio-Tokenizer weights and fine-tuned for audio understanding.
Inference optimization
According to Xiaomi, the hybrid attention architecture reduces KV-cache storage by nearly 6× compared to full attention models while maintaining long-context performance through learnable attention sink bias. The model includes three multi-token prediction (MTP) modules totaling 329M parameters that enable speculative decoding for faster inference.
Training methodology
Xiaomi trained MiMo-V2.5 in five stages: text pre-training, projector warmup, multimodal pre-training, supervised fine-tuning with agentic data, and reinforcement learning with Multi-Teacher On-Policy Distillation (MOPD). The context window was progressively extended from 32K to 256K to 1M tokens during post-training.
Benchmark performance
Xiaomi claims the model achieved 56.1 on SWE Bench Pro and 65.8 on Terminalbench 2. The company provides additional benchmark results across multimodal, coding, agent, and long-context tasks on the model card, though specific scores for many benchmarks were not disclosed in the release.
Availability
The model is available on Hugging Face in two variants: MiMo-V2.5-Base with 256K context and MiMo-V2.5 with 1M context. Xiaomi recommends deploying with SGLang or vLLM inference engines using FP8 quantization. Pricing for API access was not disclosed.
What this means
MiMo-V2.5 represents Xiaomi's push into large-scale multimodal AI, competing directly with models like GPT-4o and Claude 3.5 Sonnet in the omnimodal space. The 1M token context window and sparse MoE architecture position it for long-document and agentic workflows, though the lack of disclosed pricing makes it difficult to assess commercial viability. The hybrid attention mechanism's claimed 6× reduction in KV-cache could prove significant for deployment costs if validated by independent benchmarks.
Related Articles
Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.
Alibaba's Qwen Team Releases Qwen3.6 27B With 262K Context Window and Video Processing
Alibaba's Qwen Team has released Qwen3.6 27B, a 27-billion parameter multimodal language model with a 262,144-token context window. The model accepts text, image, and video inputs and includes a built-in thinking mode for extended reasoning, with pricing at $0.195 per million input tokens and $1.56 per million output tokens.
DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost
DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.
DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context
DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.
Comments
Loading...