model releaseXiaomi

Xiaomi launches MiMo-V2-Pro with 1T parameters, matches Claude Opus on coding at 80% lower cost

TL;DR

Xiaomi shipped three AI models simultaneously designed to form a complete agent platform. MiMo-V2-Pro, a 1-trillion-parameter Mixture-of-Experts model with 42 billion active parameters per request, scores 78% on SWE-bench Verified and 81 points on ClawEval—nearly matching Claude Opus 4.6 while costing $1 per million input tokens versus $5 for Opus.

March 23, 2026 · 3:20 PM3 min read

MiMo-V2-Pro — Quick Specs

Context window1000K tokens

Input$1/1M tokens

Output$3/1M tokens

Compare MiMo-V2-Pro with other models →

Xiaomi has simultaneously released three AI models designed to power autonomous agents, robots, and voice applications, marking the company's push into full-stack agent infrastructure.

MiMo-V2-Pro: Trillion-Parameter LLM

The flagship MiMo-V2-Pro runs a Mixture-of-Experts architecture with over 1 trillion total parameters, of which 42 billion are active per request—roughly 3x the scale of its December 2025 predecessor, MiMo-V2-Flash. Despite the increased scale, a hybrid attention mechanism enables efficient processing with context windows up to 1 million tokens. The model generates multiple tokens simultaneously rather than predicting one at a time, delivering a speed advantage.

Performance Benchmarks:

SWE-bench Verified: 78% (Claude Opus 4.6: 80.8%, Claude Sonnet 4.6: 79.6%)
ClawEval (agent benchmark): 81 points (Claude Opus 4.6: 81.5, GPT-5.2: 77)
PinchBench: Ranks 3rd globally behind Claude Opus 4.6
ClawEval: 3rd globally
Artificial Analysis Intelligence Index: 7th worldwide, top-ranked Chinese model after GLM-5 and MiniMax-M2.7

Pricing: $1 per million input tokens, $3 per million output tokens (up to 256,000 context). Cache writing costs currently waived. For comparison, Claude Sonnet 4.6 costs $3/$15 and Claude Opus 4.6 costs $5/$25—making MiMo-V2-Pro 80% cheaper than Opus on input tokens.

Before official launch, MiMo-V2-Pro operated anonymously on OpenRouter under codename "Hunter Alpha," where it topped daily rankings for several days and accumulated over 1 trillion tokens, with many users initially assuming it was a new DeepSeek model.

MiMo-V2-Omni: Multimodal Agent Model

MiMo-V2-Omni integrates image, video, and audio encoders into a shared backbone. The model natively supports structured tool calls, function execution, and autonomous UI navigation.

Benchmark Performance:

MMMU-Pro (image): 76.8 (beats Claude Opus 4.6's 73.9)
Audio benchmarks: Beats Gemini 3 Pro; records continuously for 10+ hours
Video: Falls short of Gemini 3 Pro
ClawEval (agent tasks): 54.8 (Claude Opus 4.6: 66.3, GPT-5.2: 59.6)
MM-BrowserComp: Outperforms both Gemini 3 Pro and GPT-5.2 on web navigation

Demos showed the model autonomously browsing dashcam footage to flag hazards, navigating e-commerce sites (Xiaohongshu, JD.com), haggling with customer service, completing purchases, and creating multimedia content with debugging—all without human intervention. OpenClaw handles the actual clicks and file operations.

MiMo-V2-TTS: Emotional Speech Synthesis

MiMo-V2-TTS trained on over 100 million hours of speech data, uses parallel discrete units for finer control over sound, rhythm, and emotion. Users describe desired voice characteristics in natural language ("sleepy, just woken up, slightly hoarse") rather than selecting from presets.

Key capabilities include natively synthesizing both speech and singing in one model, generating paralinguistic sounds (coughs, sighs, laughter) as part of output, and interpreting typographic cues (capitals, repeated characters) as emphasis signals.

Strategic Positioning

Xiaomi has partnered with five agent frameworks (OpenClaw, OpenCode, KiloCode, Blackbox, Cline) and offering free API access for one week. The company explicitly targets AI agent and robotics workloads, with the MiMo team stating next priorities are long-term planning across hours/days, real-time streaming, multi-agent coordination, and robotics.

What this means

Xiaomi is pricing competitively to dislodge Anthropic from dominance in the agent/coding segment, with MiMo-V2-Pro demonstrating near-parity on benchmarks at significantly lower cost. However, meaningful gaps remain on general agent tasks (ClawEval scores lag Claude Opus) and multimodal reasoning (video performance trails Gemini 3 Pro). The three-model release signals Xiaomi's commitment to building an integrated agent platform rather than competing on single models—a different approach from Anthropic's integrated multimodal strategy.

Source: the-decoder.com ↗

xiaomi mimo large-language-models multimodal text-to-speech ai-agents mixture-of-experts pricing

model releaseMay 7, 2026

Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens

Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.

model releaseMay 6, 2026

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.

model releaseMay 6, 2026

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.