HyperAIHyperAI

Command Palette

Search for a command to run...

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li Yuechen Zhang Chengyao Wang Zhisheng Zhong Yixin Chen Ruihang Chu Shaoteng Liu Jiaya Jia

Abstract

In this work, we introduce Mini-Gemini, a simple and effective frameworkenhancing multi-modality Vision Language Models (VLMs). Despite theadvancements in VLMs facilitating basic visual dialog and reasoning, aperformance gap persists compared to advanced models like GPT-4 and Gemini. Wetry to narrow the gap by mining the potential of VLMs for better performanceand any-to-any workflow from three aspects, i.e., high-resolution visualtokens, high-quality data, and VLM-guided generation. To enhance visual tokens,we propose to utilize an additional visual encoder for high-resolutionrefinement without increasing the visual token count. We further construct ahigh-quality dataset that promotes precise image comprehension andreasoning-based generation, expanding the operational scope of current VLMs. Ingeneral, Mini-Gemini further mines the potential of VLMs and empowers currentframeworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)from 2B to 34B. It is demonstrated to achieve leading performance in severalzero-shot benchmarks and even surpasses the developed private models. Code andmodels are available at https://github.com/dvlab-research/MiniGemini.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp