HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song; Wenhao Chai; Guanhong Wang; Yucheng Zhang; Haoyang Zhou; Feiyang Wu; Haozhe Chi; Xun Guo; Tian Ye; Yanting Zhang; Yan Lu; Jenq-Neng Hwang; Gaoang Wang

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Abstract

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

Code Repositories

rese1f/MovieChat
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
question-answering-on-next-qa-open-endedMovieChat
Accuracy: 49.9
Confidence Score: 2.7
video-based-generative-performance-1MovieChat
gpt-score: 2.76
video-based-generative-performance-2MovieChat
gpt-score: 2.42
video-based-generative-performance-3MovieChat
gpt-score: 3.01
video-based-generative-performance-4MovieChat
gpt-score: 2.93
video-based-generative-performance-5MovieChat
gpt-score: 2.24
video-question-answering-on-activitynet-qaMovieChat
Accuracy: 45.7
Confidence score: 3.1
zeroshot-video-question-answer-on-activitynetMovieChat
Accuracy: 45.7
Confidence Score: 3.1
zeroshot-video-question-answer-on-msrvtt-qaMovieChat
Accuracy: 52.7
Confidence Score: 2.6
zeroshot-video-question-answer-on-msvd-qaMovieChat
Accuracy: 75.2
Confidence Score: 2.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Papers | HyperAI