6 months ago

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance inunderstanding visual content with language instruction by converting visualinput to vision tokens. However, redundancy in vision tokens results in thedegenerated inference efficiency of VLMs. While many algorithms have beenproposed to reduce the number of vision tokens, most of them apply onlyunimodal information (i.e., vision/text) for pruning and ignore the inherentmultimodal property of vision-language tasks. Moreover, it lacks a genericcriterion that can be applied to different modalities. To mitigate thislimitation, in this work, we propose to leverage both vision and text tokens toselect informative vision tokens by the criterion of coverage. We firstformulate the subset selection problem as a maximum coverage problem.Afterward, a subset of vision tokens is optimized to cover the text tokens andthe original set of vision tokens, simultaneously. Finally, a VLM agent can beadopted to further improve the quality of text tokens for guiding visionpruning. The proposed method MMTok is extensively evaluated on benchmarkdatasets with different VLMs. The comparison illustrates that vision and textinformation are complementary, and combining multimodal information can surpassthe unimodal baseline with a clear margin. Moreover, under the maximum coveragecriterion on the POPE dataset, our method achieves a 1.87x speedup whilemaintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore,with only four vision tokens, it still preserves 87.7% of the originalperformance on LLaVA-1.5-7B. These results highlight the effectiveness ofcoverage in token selection.