Command Palette
Search for a command to run...
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He Hengduo Li Young Kyun Jang Menglin Jia Xuefei Cao Ashish Shah Abhinav Shrivastava Ser-Nam Lim

Abstract
With the success of large language models (LLMs), integrating the visionmodel into LLMs to build vision-language foundation models has gained much moreinterest recently. However, existing LLM-based large multimodal models (e.g.,Video-LLaMA, VideoChat) can only take in a limited number of frames for shortvideo understanding. In this study, we mainly focus on designing an efficientand effective model for long-term video understanding. Instead of trying toprocess more frames simultaneously like most existing work, we propose toprocess videos in an online manner and store past video information in a memorybank. This allows our model to reference historical video content for long-termanalysis without exceeding LLMs' context length constraints or GPU memorylimits. Our memory bank can be seamlessly integrated into current multimodalLLMs in an off-the-shelf manner. We conduct extensive experiments on variousvideo understanding tasks, such as long-video understanding, video questionanswering, and video captioning, and our model can achieve state-of-the-artperformances across multiple datasets. Code available athttps://boheumd.github.io/MA-LMM/.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| temporal-relation-extraction-on-vinoground | MA-LMM-Vicuna-7B | Group Score: 6.8 Text Score: 23.8 Video Score: 25.6 |
| video-captioning-on-youcook2 | MA-LMM | CIDEr: 1.31 METEOR: 17.6 |
| video-classification-on-breakfast | MA-LMM | Accuracy (%): 93.0 |
| video-classification-on-coin-1 | MA-LMM | Accuracy (%): 93.2 |
| video-question-answering-on-activitynet-qa | MA-LMM | Accuracy: 49.8 |
| video-question-answering-on-msrvtt-qa | MA-LMM | Accuracy: 48.5 |
| visual-question-answering-on-msvd-qa-1 | MA-LMM | Accuracy: 0.606 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.