Command Palette
Search for a command to run...
Ranasinghe Kanchana ; Li Xiang ; Kahatapitiya Kumara ; Ryoo Michael S.

Abstract
Large Language Models (LLMs) have allowed recent LLM-based approaches toachieve excellent performance on long-video understanding benchmarks. Weinvestigate how extensive world knowledge and strong reasoning skills ofunderlying LLMs influence this strong performance. Surprisingly, we discoverthat LLM-based approaches can yield surprisingly good accuracy on long-videotasks with limited video information, sometimes even with no video specificinformation. Building on this, we explore injecting video-specific informationinto an LLM-based framework. We utilize off-the-shelf vision tools to extractthree object-centric information modalities from videos, and then leveragenatural language as a medium for fusing this information. Our resultingMultimodal Video Understanding (MVU) framework demonstrates state-of-the-artperformance across multiple video understanding benchmarks. Strong performancealso on robotics domain tasks establish its strong generality. Code:https://github.com/kahnchana/mvu
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-video-question-answer-on-egoschema | MVU (13B) | Accuracy: 60.3 Inference Speed (s): 2.42 |
| zero-shot-video-question-answer-on-egoschema-1 | MVU (13B) | Accuracy: 37.6 |
| zero-shot-video-question-answer-on-next-qa | MVU (13B) | Accuracy: 55.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.