HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong Juhua Hu Mian Zhang Ming Yin Yanjie Fu Qi Qian

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance inunderstanding visual content with language instruction by converting visualinput to vision tokens. However, redundancy in vision tokens results in thedegenerated inference efficiency of VLMs. While many algorithms have beenproposed to reduce the number of vision tokens, most of them apply onlyunimodal information (i.e., vision/text) for pruning and ignore the inherentmultimodal property of vision-language tasks. Moreover, it lacks a genericcriterion that can be applied to different modalities. To mitigate thislimitation, in this work, we propose to leverage both vision and text tokens toselect informative vision tokens by the criterion of coverage. We firstformulate the subset selection problem as a maximum coverage problem.Afterward, a subset of vision tokens is optimized to cover the text tokens andthe original set of vision tokens, simultaneously. Finally, a VLM agent can beadopted to further improve the quality of text tokens for guiding visionpruning. The proposed method MMTok is extensively evaluated on benchmarkdatasets with different VLMs. The comparison illustrates that vision and textinformation are complementary, and combining multimodal information can surpassthe unimodal baseline with a clear margin. Moreover, under the maximum coveragecriterion on the POPE dataset, our method achieves a 1.87x speedup whilemaintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore,with only four vision tokens, it still preserves 87.7% of the originalperformance on LLaVA-1.5-7B. These results highlight the effectiveness ofcoverage in token selection.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs | Papers | HyperAI