8 months ago

Abstract

The past year has witnessed the significant advancement of video-based largelanguage models. However, the challenge of developing a unified model for bothshort and long video understanding remains unresolved. Most existing video LLMscannot handle hour-long videos, while methods custom for long videos tend to beineffective for shorter videos and images. In this paper, we identify the keyissue as the redundant content in videos. To address this, we propose a novelpooling strategy that simultaneously achieves token compression andinstruction-aware visual feature aggregation. Our model is termed Prompt-guidedPooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of threecore components: the CLIP-based visual-prompt alignment that extracts visualinformation relevant to the user's instructions, the prompt-guided pooling thatcompresses the visual sequence to arbitrary scales using convolution-stylepooling, and the clip context extension designed for lengthy prompt common invisual dialogue. Moreover, our codebase also integrates the most advanced videoDirect Preference Optimization (DPO) and visual interleave training. Extensiveexperiments have validated the performance of our model. With superiorthroughput and only 1024 visual context, PPLLaVA achieves better results onimage benchmarks as a video LLM, while achieving state-of-the-art performanceacross various video benchmarks, excelling in tasks ranging from captiongeneration to multiple-choice questions, and handling video lengths fromseconds to hours. Codes have been available athttps://github.com/farewellthree/PPLLaVA.

Source PDF