Xiaomi

MiMo-V2-Omni

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities, 256K context window.

Input / 1M tokens: $0.400
Output / 1M tokens: $2.00
Context window: 262K tokens
Provider: Xiaomi
Cached input / 1M: $0.080

Performance

Median streaming throughput and first-token latency measured by Artificial Analysis.

Output tokens / sec: 0 t/s
Time to first token: 0.00s

Benchmarks

Intelligence, coding, and math indexes plus the underlying evaluation scores.

Intelligence Index: 43
Coding Index: 36
Math Index: —
MMLU-Pro: —
GPQA: 82.8%
HLE: 19.9%
LiveCodeBench: —
SciCode: 36.7%
MATH-500: —
AIME: —

Benchmarks via Artificial Analysis