HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Maaz Muhammad ; Rasheed Hanoona ; Khan Salman ; Khan Fahad

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video
  Understanding

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs)have contributed significant improvements in video understanding. While thecurrent video LMMs utilize advanced Large Language Models (LLMs), they rely oneither image or video encoders to process visual inputs, each of which has itsown limitations. Image encoders excel at capturing rich spatial details fromframe sequences but lack explicit temporal context, which can be important invideos with intricate action sequences. On the other hand, video encodersprovide temporal context but are often limited by computational constraintsthat lead to processing only sparse frames at lower resolutions, resulting inreduced contextual and spatial understanding. To this end, we introduceVideoGPT+, which combines the complementary benefits of the image encoder (fordetailed spatial understanding) and the video encoder (for global temporalcontext modeling). The model processes videos by dividing them into smallersegments and applies an adaptive pooling strategy on features extracted by bothimage and video encoders. Our architecture showcases improved performanceacross multiple video benchmarks, including VCGBench, MVBench and Zero-shotquestion-answering. Further, we develop 112K video-instruction set using anovel semi-automatic annotation pipeline which further improves the modelperformance. Additionally, to comprehensively evaluate video LMMs, we presentVCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports,science, gaming, and surveillance videos. This benchmark with 4,354question-answer pairs evaluates the generalization of existing LMMs on densevideo captioning, spatial and temporal understanding, and complex reasoning,ensuring comprehensive assessment across diverse video types and dynamics.Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

Code Repositories

mbzuai-oryx/videogpt-plus
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
vcgbench-diverse-on-videoinstructVideoGPT+
Consistency: 2.59
Contextual Understanding: 2.81
Correctness of Information: 2.46
Dense Captioning: 1.38
Detail Orientation: 2.73
Reasoning: 3.63
Spatial Understanding: 2.80
Temporal Understanding: 1.78
mean: 2.47
video-based-generative-performanceVideoGPT+
Consistency: 3.39
Contextual Understanding: 3.74
Correctness of Information: 3.27
Detail Orientation: 3.18
Temporal Understanding: 2.83
mean: 3.28
video-based-generative-performance-1VideoGPT+
gpt-score: 3.27
video-based-generative-performance-2VideoGPT+
gpt-score: 3.39
video-based-generative-performance-3VideoGPT+
gpt-score: 3.74
video-based-generative-performance-4VideoGPT+
gpt-score: 3.18
video-based-generative-performance-5VideoGPT+
gpt-score: 2.83
video-question-answering-on-mvbenchVideoGPT+
Avg.: 58.7
video-question-answering-on-tvbenchVideoGPT+
Average Accuracy: 41.7
zeroshot-video-question-answer-on-activitynetVideoGPT+
Accuracy: 50.6
Confidence Score: 3.6
zeroshot-video-question-answer-on-msrvtt-qaVideoGPT+
Accuracy: 60.6
Confidence Score: 3.6
zeroshot-video-question-answer-on-msvd-qaVideoGPT+
Accuracy: 72.4
Confidence Score: 3.6
zeroshot-video-question-answer-on-tgif-qaVideoGPT+
Accuracy: 74.6
Confidence Score: 4.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Papers | HyperAI