X
Xiaomi
MiMo-V2-Omni
MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities, 256K context window.
- Input / 1M tokens
- $0.400
- Output / 1M tokens
- $2.00
- Context window
- 262K tokens
- Provider
- Xiaomi
- Cached input / 1M
- $0.080
Performance
Median streaming throughput and first-token latency measured by Artificial Analysis.
- Output tokens / sec
- 0 t/s
- Time to first token
- 0.00s
Benchmarks
Intelligence, coding, and math indexes plus the underlying evaluation scores.
- Intelligence Index
- 43
- Coding Index
- 36
- Math Index
- —
- MMLU-Pro
- —
- GPQA
- 82.8%
- HLE
- 19.9%
- LiveCodeBench
- —
- SciCode
- 36.7%
- MATH-500
- —
- AIME
- —
Benchmarks via Artificial Analysis