model releaseXiaomi

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

TL;DR

Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.

April 28, 2026 · 1:06 AM2 min read

MiMo-V2.5 — Quick Specs

Context window1000K tokens

Compare MiMo-V2.5 with other models →

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts (MoE) model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window.

Architecture and specifications

MiMo-V2.5 uses a sparse MoE architecture with 256 routed experts, activating 8 experts per token. The model consists of 48 layers total: 1 dense layer and 47 MoE layers, with 39 using sliding window attention (SWA) and 9 using full attention.

Key specifications:

Total parameters: 310B (15B activated per forward pass)
Context window: Up to 1M tokens
Hidden size: 4096
Attention heads: 64 query heads, split between 8 KV heads for global attention and 4 for sliding window attention
Sliding window size: 128 tokens
Training data: ~48T tokens using FP8 mixed precision

Multimodal encoders

The model includes dedicated encoders for vision and audio:

Vision encoder: 729M-parameter Vision Transformer (ViT) with 28 layers—24 using sliding window attention and 4 using full attention. The encoder employs a hybrid window-attention pattern alternating between 1-D row and column windows with 64-token sliding windows.

Audio encoder: 261M-parameter Audio Transformer with 24 layers, split evenly between 12 sliding window attention layers and 12 full attention layers. The encoder was initialized from MiMo-Audio-Tokenizer weights and fine-tuned for audio understanding.

Inference optimization

According to Xiaomi, the hybrid attention architecture reduces KV-cache storage by nearly 6× compared to full attention models while maintaining long-context performance through learnable attention sink bias. The model includes three multi-token prediction (MTP) modules totaling 329M parameters that enable speculative decoding for faster inference.

Training methodology

Xiaomi trained MiMo-V2.5 in five stages: text pre-training, projector warmup, multimodal pre-training, supervised fine-tuning with agentic data, and reinforcement learning with Multi-Teacher On-Policy Distillation (MOPD). The context window was progressively extended from 32K to 256K to 1M tokens during post-training.

Benchmark performance

Xiaomi claims the model achieved 56.1 on SWE Bench Pro and 65.8 on Terminalbench 2. The company provides additional benchmark results across multimodal, coding, agent, and long-context tasks on the model card, though specific scores for many benchmarks were not disclosed in the release.

Availability

The model is available on Hugging Face in two variants: MiMo-V2.5-Base with 256K context and MiMo-V2.5 with 1M context. Xiaomi recommends deploying with SGLang or vLLM inference engines using FP8 quantization. Pricing for API access was not disclosed.

What this means

MiMo-V2.5 represents Xiaomi's push into large-scale multimodal AI, competing directly with models like GPT-4o and Claude 3.5 Sonnet in the omnimodal space. The 1M token context window and sparse MoE architecture position it for long-document and agentic workflows, though the lack of disclosed pricing makes it difficult to assess commercial viability. The hybrid attention mechanism's claimed 6× reduction in KV-cache could prove significant for deployment costs if validated by independent benchmarks.

Source: huggingface.co ↗

xiaomi mimo multimodal sparse-moe vision audio long-context open-source

model releaseApril 27, 2026

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.

model releaseApril 27, 2026

Alibaba's Qwen Team Releases Qwen3.6 27B With 262K Context Window and Video Processing

Alibaba's Qwen Team has released Qwen3.6 27B, a 27-billion parameter multimodal language model with a 262,144-token context window. The model accepts text, image, and video inputs and includes a built-in thinking mode for extended reasoning, with pricing at $0.195 per million input tokens and $1.56 per million output tokens.

model releaseApril 24, 2026

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.

model releaseApril 24, 2026

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

MiMo-V2.5 — Quick Specs

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

Architecture and specifications

Multimodal encoders

Inference optimization

Training methodology

Benchmark performance

Availability

What this means

Related Articles

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

Alibaba's Qwen Team Releases Qwen3.6 27B With 262K Context Window and Video Processing

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

Comments