Command Palette
Search for a command to run...
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Abstract
Multimodal Large Language Models (MLLMs) have shown promising progress inunderstanding and analyzing video content. However, processing long videosremains a significant challenge constrained by LLM's context size. To addressthis limitation, we propose LongVU, a spatiotemporal adaptive compressionmechanism thats reduces the number of video tokens while preserving visualdetails of long videos. Our idea is based on leveraging cross-modal query andinter-frame dependencies to adaptively reduce temporal and spatial redundancyin videos. Specifically, we leverage DINOv2 features to remove redundant framesthat exhibit high similarity. Then we utilize text-guided cross-modal query forselective frame feature reduction. Further, we perform spatial token reductionacross frames based on their temporal dependencies. Our adaptive compressionstrategy effectively processes a large number of frames with little visualinformation loss within given context length. Our LongVU consistently surpassexisting methods across a variety of video understanding benchmarks, especiallyon hour-long video understanding tasks such as VideoMME and MLVU. Given alight-weight LLM, our LongVU also scales effectively into a smaller size withstate-of-the-art video understanding performance.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-mvbench | LongVU (7B) | Avg.: 66.9 |
| zero-shot-video-question-answer-on-egoschema-1 | LongVU (7B) | Accuracy: 67.6 |
| zero-shot-video-question-answer-on-video-mme-1 | LongVU (7B) | Accuracy (%): 60.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.