HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language
  Understanding

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress inunderstanding and analyzing video content. However, processing long videosremains a significant challenge constrained by LLM's context size. To addressthis limitation, we propose LongVU, a spatiotemporal adaptive compressionmechanism thats reduces the number of video tokens while preserving visualdetails of long videos. Our idea is based on leveraging cross-modal query andinter-frame dependencies to adaptively reduce temporal and spatial redundancyin videos. Specifically, we leverage DINOv2 features to remove redundant framesthat exhibit high similarity. Then we utilize text-guided cross-modal query forselective frame feature reduction. Further, we perform spatial token reductionacross frames based on their temporal dependencies. Our adaptive compressionstrategy effectively processes a large number of frames with little visualinformation loss within given context length. Our LongVU consistently surpassexisting methods across a variety of video understanding benchmarks, especiallyon hour-long video understanding tasks such as VideoMME and MLVU. Given alight-weight LLM, our LongVU also scales effectively into a smaller size withstate-of-the-art video understanding performance.

Code Repositories

Vision-CAIR/LongVU
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-mvbenchLongVU (7B)
Avg.: 66.9
zero-shot-video-question-answer-on-egoschema-1LongVU (7B)
Accuracy: 67.6
zero-shot-video-question-answer-on-video-mme-1LongVU (7B)
Accuracy (%): 60.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | Papers | HyperAI