HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Boyuan Sun Jiaxing Zhao Xihan Wei Qibin Hou

LLaVA-Scissor: Token Compression with Semantic Connected Components for
  Video LLMs

Abstract

In this paper, we present LLaVA-Scissor, a training-free token compressionstrategy designed for video multimodal large language models. Previous methodsmostly attempt to compress tokens based on attention scores, but fail toeffectively capture all semantic regions and often lead to token redundancy.Differently, we propose to leverage the Semantic Connected Components (SCC)approach that assigns tokens to distinct semantic regions within the token set,ensuring comprehensive semantic coverage. The outcome is a two-stepspatio-temporal token compression strategy that utilizes SCC in both spatialand temporal domains. This strategy can effectively compress tokens byrepresenting the entire video with a set of non-overlapping semantic tokens. Weconduct extensive evaluations of the token compression capabilities ofLLaVA-Scissor across diverse video understanding benchmarks, including videoquestion answering, long video understanding, and comprehensive multi-choicesbenchmarks. Experimental results show that the proposed LLaVA-Scissoroutperforms other token compression methods, achieving superior performance invarious video understanding benchmarks, particularly at low token retentionratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

Code Repositories

HumanMLLM/LLaVA-Scissor
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs | Papers | HyperAI