4 months ago

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen Yunyang Xiong Changsheng Zhao Lemeng Wu Jun Chen Chenchen Zhu Zechun Liu Fanyi Xiao Balakrishnan Varadarajan Florian Bordes

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress inunderstanding and analyzing video content. However, processing long videosremains a significant challenge constrained by LLM's context size. To addressthis limitation, we propose LongVU, a spatiotemporal adaptive compressionmechanism thats reduces the number of video tokens while preserving visualdetails of long videos. Our idea is based on leveraging cross-modal query andinter-frame dependencies to adaptively reduce temporal and spatial redundancyin videos. Specifically, we leverage DINOv2 features to remove redundant framesthat exhibit high similarity. Then we utilize text-guided cross-modal query forselective frame feature reduction. Further, we perform spatial token reductionacross frames based on their temporal dependencies. Our adaptive compressionstrategy effectively processes a large number of frames with little visualinformation loss within given context length. Our LongVU consistently surpassexisting methods across a variety of video understanding benchmarks, especiallyon hour-long video understanding tasks such as VideoMME and MLVU. Given alight-weight LLM, our LongVU also scales effectively into a smaller size withstate-of-the-art video understanding performance.

Code Repositories

Vision-CAIR/LongVU

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-question-answering-on-mvbench	LongVU (7B)	Avg.: 66.9
zero-shot-video-question-answer-on-egoschema-1	LongVU (7B)	Accuracy: 67.6
zero-shot-video-question-answer-on-video-mme-1	LongVU (7B)	Accuracy (%): 60.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette