MiMo-V2-Omni
Xiaomi's frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. Combines strong multimodal perception with agentic capability including visual grounding, multi-step planning, tool use, and code execution.
Context window262K tokens
Input / 1M tokens$0.4
Output / 1M tokens$2
Version History
mimo-v2-omni-2026-03-18major
Xiaomi debuts MiMo-V2-Omni, a frontier omni-modal model processing image, video, and audio natively. 262K context with strong agentic capabilities including visual grounding and code execution.