4 months ago

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Shaolei Zhang Qingkai Fang Zhe Yang Yang Feng

Abstract

The advent of real-time large multimodal models (LMMs) like GPT-4o hassparked considerable interest in efficient LMMs. LMM frameworks typicallyencode visual inputs into vision tokens (continuous representations) andintegrate them and textual instructions into the context of large languagemodels (LLMs), where large-scale parameters and numerous context tokens(predominantly vision tokens) result in substantial computational overhead.Previous efforts towards efficient LMMs always focus on replacing the LLMbackbone with smaller models, while neglecting the crucial issue of tokenquantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimalvision tokens. To achieve a high compression ratio of vision tokens whilepreserving visual information, we first analyze how LMMs understand visiontokens and find that most vision tokens only play a crucial role in the earlylayers of LLM backbone, where they mainly fuse visual information into texttokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion tofuse visual information into text tokens in advance, thereby facilitating theextreme compression of vision tokens fed to LLM backbone into one token.LLaVA-Mini is a unified large multimodal model that can support theunderstanding of images, high-resolution images, and videos in an efficientmanner. Experiments across 11 image-based and 7 video-based benchmarksdemonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision tokeninstead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by77%, deliver low-latency responses within 40 milliseconds, and process over10,000 frames of video on the GPU hardware with 24GB of memory.

Code Repositories

ictnlp/llava-mini

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
zeroshot-video-question-answer-on-activitynet	LLaVA-Mini	Accuracy: 53.5 Confidence Score: 3.5
zeroshot-video-question-answer-on-msrvtt-qa	LLaVA-Mini	Accuracy: 59.5 Confidence Score: 3.6
zeroshot-video-question-answer-on-msvd-qa	LLaVA-Mini	Accuracy: 70.9 Confidence Score: 4.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette