8 months ago

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs)have contributed significant improvements in video understanding. While thecurrent video LMMs utilize advanced Large Language Models (LLMs), they rely oneither image or video encoders to process visual inputs, each of which has itsown limitations. Image encoders excel at capturing rich spatial details fromframe sequences but lack explicit temporal context, which can be important invideos with intricate action sequences. On the other hand, video encodersprovide temporal context but are often limited by computational constraintsthat lead to processing only sparse frames at lower resolutions, resulting inreduced contextual and spatial understanding. To this end, we introduceVideoGPT+, which combines the complementary benefits of the image encoder (fordetailed spatial understanding) and the video encoder (for global temporalcontext modeling). The model processes videos by dividing them into smallersegments and applies an adaptive pooling strategy on features extracted by bothimage and video encoders. Our architecture showcases improved performanceacross multiple video benchmarks, including VCGBench, MVBench and Zero-shotquestion-answering. Further, we develop 112K video-instruction set using anovel semi-automatic annotation pipeline which further improves the modelperformance. Additionally, to comprehensively evaluate video LMMs, we presentVCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports,science, gaming, and surveillance videos. This benchmark with 4,354question-answer pairs evaluates the generalization of existing LMMs on densevideo captioning, spatial and temporal understanding, and complex reasoning,ensuring comprehensive assessment across diverse video types and dynamics.Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

Source PDF View Code