model releaseXiaomi

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

TL;DR

Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.

2 min read
0

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts (MoE) model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window.

Architecture and specifications

MiMo-V2.5 uses a sparse MoE architecture with 256 routed experts, activating 8 experts per token. The model consists of 48 layers total: 1 dense layer and 47 MoE layers, with 39 using sliding window attention (SWA) and 9 using full attention.

Key specifications:

  • Total parameters: 310B (15B activated per forward pass)
  • Context window: Up to 1M tokens
  • Hidden size: 4096
  • Attention heads: 64 query heads, split between 8 KV heads for global attention and 4 for sliding window attention
  • Sliding window size: 128 tokens
  • Training data: ~48T tokens using FP8 mixed precision

Multimodal encoders

The model includes dedicated encoders for vision and audio:

Vision encoder: 729M-parameter Vision Transformer (ViT) with 28 layers—24 using sliding window attention and 4 using full attention. The encoder employs a hybrid window-attention pattern alternating between 1-D row and column windows with 64-token sliding windows.

Audio encoder: 261M-parameter Audio Transformer with 24 layers, split evenly between 12 sliding window attention layers and 12 full attention layers. The encoder was initialized from MiMo-Audio-Tokenizer weights and fine-tuned for audio understanding.

Inference optimization

According to Xiaomi, the hybrid attention architecture reduces KV-cache storage by nearly 6× compared to full attention models while maintaining long-context performance through learnable attention sink bias. The model includes three multi-token prediction (MTP) modules totaling 329M parameters that enable speculative decoding for faster inference.

Training methodology

Xiaomi trained MiMo-V2.5 in five stages: text pre-training, projector warmup, multimodal pre-training, supervised fine-tuning with agentic data, and reinforcement learning with Multi-Teacher On-Policy Distillation (MOPD). The context window was progressively extended from 32K to 256K to 1M tokens during post-training.

Benchmark performance

Xiaomi claims the model achieved 56.1 on SWE Bench Pro and 65.8 on Terminalbench 2. The company provides additional benchmark results across multimodal, coding, agent, and long-context tasks on the model card, though specific scores for many benchmarks were not disclosed in the release.

Availability

The model is available on Hugging Face in two variants: MiMo-V2.5-Base with 256K context and MiMo-V2.5 with 1M context. Xiaomi recommends deploying with SGLang or vLLM inference engines using FP8 quantization. Pricing for API access was not disclosed.

What this means

MiMo-V2.5 represents Xiaomi's push into large-scale multimodal AI, competing directly with models like GPT-4o and Claude 3.5 Sonnet in the omnimodal space. The 1M token context window and sparse MoE architecture position it for long-document and agentic workflows, though the lack of disclosed pricing makes it difficult to assess commercial viability. The hybrid attention mechanism's claimed 6× reduction in KV-cache could prove significant for deployment costs if validated by independent benchmarks.

Related Articles

model release

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

model release

Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif

Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.

model release

Anthropic releases Fable 5, bringing capabilities of restricted Mythos model to public with $10/$50 per 1M token pricing

Anthropic has released Fable 5, making capabilities from its previously restricted Mythos model available to the public. The company claims Fable 5 beats GPT-5.5, Gemini 3.1 Pro, and its own Opus 4.8 in internal testing, with pricing set at $10 per million input tokens and $50 per million output tokens after a free trial period ending June 22.

Comments

Loading...