model releaseXiaomi

Xiaomi launches MiMo-V2-Pro with 1T parameters, matches Claude Opus on coding at 80% lower cost

TL;DR

Xiaomi shipped three AI models simultaneously designed to form a complete agent platform. MiMo-V2-Pro, a 1-trillion-parameter Mixture-of-Experts model with 42 billion active parameters per request, scores 78% on SWE-bench Verified and 81 points on ClawEval—nearly matching Claude Opus 4.6 while costing $1 per million input tokens versus $5 for Opus.

3 min read

MiMo-V2-Pro — Quick Specs

Context window1000K tokens
Input$1/1M tokens
Output$3/1M tokens

Xiaomi has simultaneously released three AI models designed to power autonomous agents, robots, and voice applications, marking the company's push into full-stack agent infrastructure.

MiMo-V2-Pro: Trillion-Parameter LLM

The flagship MiMo-V2-Pro runs a Mixture-of-Experts architecture with over 1 trillion total parameters, of which 42 billion are active per request—roughly 3x the scale of its December 2025 predecessor, MiMo-V2-Flash. Despite the increased scale, a hybrid attention mechanism enables efficient processing with context windows up to 1 million tokens. The model generates multiple tokens simultaneously rather than predicting one at a time, delivering a speed advantage.

Performance Benchmarks:

  • SWE-bench Verified: 78% (Claude Opus 4.6: 80.8%, Claude Sonnet 4.6: 79.6%)
  • ClawEval (agent benchmark): 81 points (Claude Opus 4.6: 81.5, GPT-5.2: 77)
  • PinchBench: Ranks 3rd globally behind Claude Opus 4.6
  • ClawEval: 3rd globally
  • Artificial Analysis Intelligence Index: 7th worldwide, top-ranked Chinese model after GLM-5 and MiniMax-M2.7

Pricing: $1 per million input tokens, $3 per million output tokens (up to 256,000 context). Cache writing costs currently waived. For comparison, Claude Sonnet 4.6 costs $3/$15 and Claude Opus 4.6 costs $5/$25—making MiMo-V2-Pro 80% cheaper than Opus on input tokens.

Before official launch, MiMo-V2-Pro operated anonymously on OpenRouter under codename "Hunter Alpha," where it topped daily rankings for several days and accumulated over 1 trillion tokens, with many users initially assuming it was a new DeepSeek model.

MiMo-V2-Omni: Multimodal Agent Model

MiMo-V2-Omni integrates image, video, and audio encoders into a shared backbone. The model natively supports structured tool calls, function execution, and autonomous UI navigation.

Benchmark Performance:

  • MMMU-Pro (image): 76.8 (beats Claude Opus 4.6's 73.9)
  • Audio benchmarks: Beats Gemini 3 Pro; records continuously for 10+ hours
  • Video: Falls short of Gemini 3 Pro
  • ClawEval (agent tasks): 54.8 (Claude Opus 4.6: 66.3, GPT-5.2: 59.6)
  • MM-BrowserComp: Outperforms both Gemini 3 Pro and GPT-5.2 on web navigation

Demos showed the model autonomously browsing dashcam footage to flag hazards, navigating e-commerce sites (Xiaohongshu, JD.com), haggling with customer service, completing purchases, and creating multimedia content with debugging—all without human intervention. OpenClaw handles the actual clicks and file operations.

MiMo-V2-TTS: Emotional Speech Synthesis

MiMo-V2-TTS trained on over 100 million hours of speech data, uses parallel discrete units for finer control over sound, rhythm, and emotion. Users describe desired voice characteristics in natural language ("sleepy, just woken up, slightly hoarse") rather than selecting from presets.

Key capabilities include natively synthesizing both speech and singing in one model, generating paralinguistic sounds (coughs, sighs, laughter) as part of output, and interpreting typographic cues (capitals, repeated characters) as emphasis signals.

Strategic Positioning

Xiaomi has partnered with five agent frameworks (OpenClaw, OpenCode, KiloCode, Blackbox, Cline) and offering free API access for one week. The company explicitly targets AI agent and robotics workloads, with the MiMo team stating next priorities are long-term planning across hours/days, real-time streaming, multi-agent coordination, and robotics.

What this means

Xiaomi is pricing competitively to dislodge Anthropic from dominance in the agent/coding segment, with MiMo-V2-Pro demonstrating near-parity on benchmarks at significantly lower cost. However, meaningful gaps remain on general agent tasks (ClawEval scores lag Claude Opus) and multimodal reasoning (video performance trails Gemini 3 Pro). The three-model release signals Xiaomi's commitment to building an integrated agent platform rather than competing on single models—a different approach from Anthropic's integrated multimodal strategy.

Related Articles

model release

Xiaomi releases MiMo-V2-Pro with 1M context window and 1T+ parameters

Xiaomi released MiMo-V2-Pro on March 18, 2026, a flagship foundation model with over 1 trillion total parameters and a 1,048,576 token context window. The model is priced at $1 per million input tokens and $3 per million output tokens, positioning it as an agent-focused system comparable to top-tier models.

model release

OpenAI releases GPT-4o mini with 128K context at $0.15/$0.60 per 1M tokens

OpenAI released GPT-4o mini on July 18, 2024, a compact multimodal model with 128,000 token context window priced at $0.15 per million input tokens and $0.60 per million output tokens. The model achieves 82% on MMLU and claims to rank higher than GPT-4 on chat preference leaderboards while costing 60% less than GPT-3.5 Turbo.

model release

Rakuten releases RakutenAI-3.0, 671B-parameter Japanese-optimized mixture-of-experts model

Rakuten Group has released RakutenAI-3.0, a 671 billion parameter mixture-of-experts (MoE) model designed specifically for Japanese language tasks. The model activates 37 billion parameters per token and supports a 128K context window. It is available under the Apache License 2.0 on Hugging Face.

model release

Nvidia releases Nemotron 3 Super: 120B MoE model with 1M token context

Nvidia has released Nemotron 3 Super, a 120-billion parameter hybrid Mamba-Transformer Mixture-of-Experts model that activates only 12 billion parameters during inference. The open-weight model features a 1-million token context window, multi-token prediction capabilities, and pricing at $0.10 per million input tokens and $0.50 per million output tokens.