Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.
Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts (MoE) model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window.
Architecture and specifications
MiMo-V2.5 uses a sparse MoE architecture with 256 routed experts, activating 8 experts per token. The model consists of 48 layers total: 1 dense layer and 47 MoE layers, with 39 using sliding window attention (SWA) and 9 using full attention.
Key specifications:
- Total parameters: 310B (15B activated per forward pass)
- Context window: Up to 1M tokens
- Hidden size: 4096
- Attention heads: 64 query heads, split between 8 KV heads for global attention and 4 for sliding window attention
- Sliding window size: 128 tokens
- Training data: ~48T tokens using FP8 mixed precision
Multimodal encoders
The model includes dedicated encoders for vision and audio:
Vision encoder: 729M-parameter Vision Transformer (ViT) with 28 layers—24 using sliding window attention and 4 using full attention. The encoder employs a hybrid window-attention pattern alternating between 1-D row and column windows with 64-token sliding windows.
Audio encoder: 261M-parameter Audio Transformer with 24 layers, split evenly between 12 sliding window attention layers and 12 full attention layers. The encoder was initialized from MiMo-Audio-Tokenizer weights and fine-tuned for audio understanding.
Inference optimization
According to Xiaomi, the hybrid attention architecture reduces KV-cache storage by nearly 6× compared to full attention models while maintaining long-context performance through learnable attention sink bias. The model includes three multi-token prediction (MTP) modules totaling 329M parameters that enable speculative decoding for faster inference.
Training methodology
Xiaomi trained MiMo-V2.5 in five stages: text pre-training, projector warmup, multimodal pre-training, supervised fine-tuning with agentic data, and reinforcement learning with Multi-Teacher On-Policy Distillation (MOPD). The context window was progressively extended from 32K to 256K to 1M tokens during post-training.
Benchmark performance
Xiaomi claims the model achieved 56.1 on SWE Bench Pro and 65.8 on Terminalbench 2. The company provides additional benchmark results across multimodal, coding, agent, and long-context tasks on the model card, though specific scores for many benchmarks were not disclosed in the release.
Availability
The model is available on Hugging Face in two variants: MiMo-V2.5-Base with 256K context and MiMo-V2.5 with 1M context. Xiaomi recommends deploying with SGLang or vLLM inference engines using FP8 quantization. Pricing for API access was not disclosed.
What this means
MiMo-V2.5 represents Xiaomi's push into large-scale multimodal AI, competing directly with models like GPT-4o and Claude 3.5 Sonnet in the omnimodal space. The 1M token context window and sparse MoE architecture position it for long-document and agentic workflows, though the lack of disclosed pricing makes it difficult to assess commercial viability. The hybrid attention mechanism's claimed 6× reduction in KV-cache could prove significant for deployment costs if validated by independent benchmarks.
Related Articles
Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM
Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.
Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure
Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.
Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif
Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.
Anthropic releases Fable 5, bringing capabilities of restricted Mythos model to public with $10/$50 per 1M token pricing
Anthropic has released Fable 5, making capabilities from its previously restricted Mythos model available to the public. The company claims Fable 5 beats GPT-5.5, Gemini 3.1 Pro, and its own Opus 4.8 in internal testing, with pricing set at $10 per million input tokens and $50 per million output tokens after a free trial period ending June 22.
Comments
Loading...