7 months ago

Abstract

In this paper, we present LLaVA-Scissor, a training-free token compressionstrategy designed for video multimodal large language models. Previous methodsmostly attempt to compress tokens based on attention scores, but fail toeffectively capture all semantic regions and often lead to token redundancy.Differently, we propose to leverage the Semantic Connected Components (SCC)approach that assigns tokens to distinct semantic regions within the token set,ensuring comprehensive semantic coverage. The outcome is a two-stepspatio-temporal token compression strategy that utilizes SCC in both spatialand temporal domains. This strategy can effectively compress tokens byrepresenting the entire video with a set of non-overlapping semantic tokens. Weconduct extensive evaluations of the token compression capabilities ofLLaVA-Scissor across diverse video understanding benchmarks, including videoquestion answering, long video understanding, and comprehensive multi-choicesbenchmarks. Experimental results show that the proposed LLaVA-Scissoroutperforms other token compression methods, achieving superior performance invarious video understanding benchmarks, particularly at low token retentionratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

Source PDF View Code