HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

Kimi-VL Technical Report

Kimi-VL Technical Report

Abstract

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE)vision-language model (VLM) that offers advanced multimodal reasoning,long-context understanding, and strong agent capabilities - all whileactivating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VLdemonstrates strong performance across challenging domains: as ageneral-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld),matching flagship models. Furthermore, it exhibits remarkable capabilitiesacross diverse challenging vision language tasks, including college-level imageand video comprehension, OCR, mathematical reasoning, and multi-imageunderstanding. In comparative evaluations, it effectively competes withcutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, andGemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL alsoadvances in processing long contexts and perceiving clearly. With a 128Kextended context window, Kimi-VL can process diverse long inputs, achievingimpressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Itsnative-resolution vision encoder, MoonViT, further allows it to see andunderstand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and34.5 on ScreenSpot-Pro, while maintaining lower computational cost for commontasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant:Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervisedfine-tuning (SFT) and reinforcement learning (RL), this model exhibits stronglong-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8on MathVision, and 71.3 on MathVista while maintaining the compact 2.8Bactivated LLM parameters, setting a new standard for efficient multimodalthinking models. Code and models are publicly accessible athttps://github.com/MoonshotAI/Kimi-VL.

Code Repositories

moonshotai/kimi-vl
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
optical-character-recognition-on-ocrbench-v2-chineseKimi-VL-A3B-16B
Accuracy: 54.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Kimi-VL Technical Report | Papers | HyperAI