model releaseXiaomi

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

TL;DR

Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.

2 min read
0

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts (MoE) language model with 1.02 trillion total parameters and 42 billion active parameters. The model is available on Hugging Face with FP8 mixed precision and supports context windows up to 1 million tokens.

Architecture and Specifications

MiMo-V2.5-Pro uses a hybrid attention architecture that interleaves Sliding Window Attention (SWA) and Global Attention (GA) in a 6:1 ratio with a 128-token sliding window. According to Xiaomi, this approach reduces KV-cache storage by approximately 7x compared to traditional full attention.

The model comprises 70 layers (1 dense layer plus 69 MoE layers), with 10 full attention layers and 60 SWA layers. It routes tokens across 384 experts, activating 8 experts per token. The architecture includes 128 attention heads with grouped-query attention using 8 KV heads.

The model integrates three Multi-Token Prediction (MTP) modules using dense feedforward networks. Xiaomi claims this configuration triples output speed during inference.

Training Details

Xiaomi trained the base model on 27 trillion tokens using FP8 mixed precision with a native sequence length of 32,768 tokens. Post-training involved supervised fine-tuning, large-scale agentic reinforcement learning, and Multi-Teacher On-Policy Distillation (MOPD).

Benchmark Performance

On standard benchmarks, MiMo-V2.5-Pro achieved:

  • GSM8K: 99.6% (8-shot)
  • MATH: 86.2% (4-shot)
  • MMLU: 89.4% (5-shot)
  • MMLU-Pro: 68.5% (5-shot)
  • HumanEval+: 75.6% (1-shot)
  • GPQA-Diamond: 66.7% (5-shot)
  • BBH: 88.4% (3-shot)

On long-context tasks, Xiaomi evaluated the model using OpenAI's GraphWalks benchmark. At 512K tokens, MiMo-V2.5-Pro scored 0.56 on breadth-first search and 0.92 on parent listing. At 1M tokens, scores dropped to 0.37 and 0.62 respectively. The company reports that the previous MiMo-V2-Pro model collapsed to 0.00 at 1M tokens on both tasks.

For agentic tasks, the model scored 35.7% on SWE-Bench (AgentLess) in 3-shot evaluation and 39.6% on LiveCodeBench v6 in 1-shot.

Availability

Pricing information has not been disclosed. The model is available for download on Hugging Face. Xiaomi recommends deploying with SGLang or vLLM for optimal performance, with official deployment cookbooks available from both inference engine communities.

A base version (MiMo-V2.5-Pro-Base) with a 256K context window is also available.

What This Means

Xiaomi's entry into the trillion-parameter MoE space puts it in direct competition with models like DeepSeek-V4 (1.6T total parameters) and Kimi-K2 (1.04T total parameters). The 99.6% GSM8K score is among the highest reported for open models, though the company's claims about 7x KV-cache reduction require independent verification. The 1M context window matches recent long-context models, but the GraphWalks performance degradation at extreme lengths remains a challenge across the industry. At 42B active parameters, inference costs should be substantially lower than full dense models of similar capability.

Related Articles

model release

DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost

DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.

model release

DeepSeek Releases V4-Pro: 1.6T Parameter MoE Model with 1M Token Context

DeepSeek released two new Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6 trillion parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated), both supporting one million token context length. The models achieve 27% of inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2 at 1M context through a hybrid attention architecture combining Compressed Sparse Attention and Heavily Compressed Attention.

model release

Alibaba Releases Qwen3.6 Max Preview: 1 Trillion Parameter MoE Model With 262K Context Window

Alibaba Cloud has released Qwen3.6 Max Preview, a proprietary frontier model built on sparse mixture-of-experts architecture with approximately 1 trillion total parameters. The model supports a 262,144-token context window and features integrated thinking mode for multi-turn reasoning, priced at $1.30 per million input tokens and $7.80 per million output tokens.

model release

Tencent Releases Hy3-Preview: 295B-Parameter MoE Model with 21B Active Parameters

Tencent has released Hy3-preview, a 295-billion-parameter Mixture-of-Experts model with 21 billion active parameters and a 256K context window. The model scores 76.28% on MATH and 34.86% on LiveCodeBench-v6, with particularly strong performance on coding agent tasks.

Comments

Loading...