Command Palette
Search for a command to run...
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Maaz Muhammad ; Rasheed Hanoona ; Khan Salman ; Khan Fahad

Abstract
Building on the advances of language models, Large Multimodal Models (LMMs)have contributed significant improvements in video understanding. While thecurrent video LMMs utilize advanced Large Language Models (LLMs), they rely oneither image or video encoders to process visual inputs, each of which has itsown limitations. Image encoders excel at capturing rich spatial details fromframe sequences but lack explicit temporal context, which can be important invideos with intricate action sequences. On the other hand, video encodersprovide temporal context but are often limited by computational constraintsthat lead to processing only sparse frames at lower resolutions, resulting inreduced contextual and spatial understanding. To this end, we introduceVideoGPT+, which combines the complementary benefits of the image encoder (fordetailed spatial understanding) and the video encoder (for global temporalcontext modeling). The model processes videos by dividing them into smallersegments and applies an adaptive pooling strategy on features extracted by bothimage and video encoders. Our architecture showcases improved performanceacross multiple video benchmarks, including VCGBench, MVBench and Zero-shotquestion-answering. Further, we develop 112K video-instruction set using anovel semi-automatic annotation pipeline which further improves the modelperformance. Additionally, to comprehensively evaluate video LMMs, we presentVCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports,science, gaming, and surveillance videos. This benchmark with 4,354question-answer pairs evaluates the generalization of existing LMMs on densevideo captioning, spatial and temporal understanding, and complex reasoning,ensuring comprehensive assessment across diverse video types and dynamics.Code: https://github.com/mbzuai-oryx/VideoGPT-plus.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| vcgbench-diverse-on-videoinstruct | VideoGPT+ | Consistency: 2.59 Contextual Understanding: 2.81 Correctness of Information: 2.46 Dense Captioning: 1.38 Detail Orientation: 2.73 Reasoning: 3.63 Spatial Understanding: 2.80 Temporal Understanding: 1.78 mean: 2.47 |
| video-based-generative-performance | VideoGPT+ | Consistency: 3.39 Contextual Understanding: 3.74 Correctness of Information: 3.27 Detail Orientation: 3.18 Temporal Understanding: 2.83 mean: 3.28 |
| video-based-generative-performance-1 | VideoGPT+ | gpt-score: 3.27 |
| video-based-generative-performance-2 | VideoGPT+ | gpt-score: 3.39 |
| video-based-generative-performance-3 | VideoGPT+ | gpt-score: 3.74 |
| video-based-generative-performance-4 | VideoGPT+ | gpt-score: 3.18 |
| video-based-generative-performance-5 | VideoGPT+ | gpt-score: 2.83 |
| video-question-answering-on-mvbench | VideoGPT+ | Avg.: 58.7 |
| video-question-answering-on-tvbench | VideoGPT+ | Average Accuracy: 41.7 |
| zeroshot-video-question-answer-on-activitynet | VideoGPT+ | Accuracy: 50.6 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-msrvtt-qa | VideoGPT+ | Accuracy: 60.6 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-msvd-qa | VideoGPT+ | Accuracy: 72.4 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-tgif-qa | VideoGPT+ | Accuracy: 74.6 Confidence Score: 4.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.