Command Palette
Search for a command to run...

Abstract
The recent surge of Multimodal Large Language Models (MLLMs) hasfundamentally reshaped the landscape of AI research and industry, sheddinglight on a promising path toward the next AI milestone. However, significantchallenges remain preventing MLLMs from being practical in real-worldapplications. The most notable challenge comes from the huge cost of running anMLLM with a massive number of parameters and extensive computation. As aresult, most MLLMs need to be deployed on high-performing cloud servers, whichgreatly limits their application scopes such as mobile, offline,energy-sensitive, and privacy-protective scenarios. In this work, we presentMiniCPM-V, a series of efficient MLLMs deployable on end-side devices. Byintegrating the latest MLLM techniques in architecture, pretraining andalignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1)Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 onOpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strongOCR capability and 1.8M pixel high-resolution image perception at any aspectratio, (3) trustworthy behavior with low hallucination rates, (4) multilingualsupport for 30+ languages, and (5) efficient deployment on mobile phones. Moreimportantly, MiniCPM-V can be viewed as a representative example of a promisingtrend: The model sizes for achieving usable (e.g., GPT-4V) level performanceare rapidly decreasing, along with the fast growth of end-side computationcapacity. This jointly shows that GPT-4V level MLLMs deployed on end devicesare becoming increasingly possible, unlocking a wider spectrum of real-world AIapplications in the near future.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| temporal-relation-extraction-on-vinoground | MiniCPM-2.6 | Group Score: 11.2 Text Score: 32.6 Video Score: 29.2 |
| zero-shot-video-question-answer-on-video-mme-1 | MiniCPM-V 2.6 (8B) | Accuracy (%): 63.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.