MiMo-V2-Omni

active

Xiaomi's frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. Combines strong multimodal perception with agentic capability including visual grounding, multi-step planning, tool use, and code execution.

Context window262K tokens
Input / 1M tokens$0.4
Output / 1M tokens$2

Version History

mimo-v2-omni-2026-03-18major

Xiaomi debuts MiMo-V2-Omni, a frontier omni-modal model processing image, video, and audio natively. 262K context with strong agentic capabilities including visual grounding and code execution.