Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.
Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts (MoE) language model with 1.02 trillion total parameters and 42 billion active parameters. The model is available on Hugging Face with FP8 mixed precision and supports context windows up to 1 million tokens.
Architecture and Specifications
MiMo-V2.5-Pro uses a hybrid attention architecture that interleaves Sliding Window Attention (SWA) and Global Attention (GA) in a 6:1 ratio with a 128-token sliding window. According to Xiaomi, this approach reduces KV-cache storage by approximately 7x compared to traditional full attention.
The model comprises 70 layers (1 dense layer plus 69 MoE layers), with 10 full attention layers and 60 SWA layers. It routes tokens across 384 experts, activating 8 experts per token. The architecture includes 128 attention heads with grouped-query attention using 8 KV heads.
The model integrates three Multi-Token Prediction (MTP) modules using dense feedforward networks. Xiaomi claims this configuration triples output speed during inference.
Training Details
Xiaomi trained the base model on 27 trillion tokens using FP8 mixed precision with a native sequence length of 32,768 tokens. Post-training involved supervised fine-tuning, large-scale agentic reinforcement learning, and Multi-Teacher On-Policy Distillation (MOPD).
Benchmark Performance
On standard benchmarks, MiMo-V2.5-Pro achieved:
- GSM8K: 99.6% (8-shot)
- MATH: 86.2% (4-shot)
- MMLU: 89.4% (5-shot)
- MMLU-Pro: 68.5% (5-shot)
- HumanEval+: 75.6% (1-shot)
- GPQA-Diamond: 66.7% (5-shot)
- BBH: 88.4% (3-shot)
On long-context tasks, Xiaomi evaluated the model using OpenAI's GraphWalks benchmark. At 512K tokens, MiMo-V2.5-Pro scored 0.56 on breadth-first search and 0.92 on parent listing. At 1M tokens, scores dropped to 0.37 and 0.62 respectively. The company reports that the previous MiMo-V2-Pro model collapsed to 0.00 at 1M tokens on both tasks.
For agentic tasks, the model scored 35.7% on SWE-Bench (AgentLess) in 3-shot evaluation and 39.6% on LiveCodeBench v6 in 1-shot.
Availability
Pricing information has not been disclosed. The model is available for download on Hugging Face. Xiaomi recommends deploying with SGLang or vLLM for optimal performance, with official deployment cookbooks available from both inference engine communities.
A base version (MiMo-V2.5-Pro-Base) with a 256K context window is also available.
What This Means
Xiaomi's entry into the trillion-parameter MoE space puts it in direct competition with models like DeepSeek-V4 (1.6T total parameters) and Kimi-K2 (1.04T total parameters). The 99.6% GSM8K score is among the highest reported for open models, though the company's claims about 7x KV-cache reduction require independent verification. The 1M context window matches recent long-context models, but the GraphWalks performance degradation at extreme lengths remains a challenge across the industry. At 42B active parameters, inference costs should be substantially lower than full dense models of similar capability.
Related Articles
Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window
Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.
Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free
Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.
Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows
Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.
NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning
NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.
Comments
Loading...