HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Ruyang Liu Haoran Tang Haibo Liu Yixiao Ge Ying Shan Chen Li Jiankun Yang

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Abstract

The past year has witnessed the significant advancement of video-based largelanguage models. However, the challenge of developing a unified model for bothshort and long video understanding remains unresolved. Most existing video LLMscannot handle hour-long videos, while methods custom for long videos tend to beineffective for shorter videos and images. In this paper, we identify the keyissue as the redundant content in videos. To address this, we propose a novelpooling strategy that simultaneously achieves token compression andinstruction-aware visual feature aggregation. Our model is termed Prompt-guidedPooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of threecore components: the CLIP-based visual-prompt alignment that extracts visualinformation relevant to the user's instructions, the prompt-guided pooling thatcompresses the visual sequence to arbitrary scales using convolution-stylepooling, and the clip context extension designed for lengthy prompt common invisual dialogue. Moreover, our codebase also integrates the most advanced videoDirect Preference Optimization (DPO) and visual interleave training. Extensiveexperiments have validated the performance of our model. With superiorthroughput and only 1024 visual context, PPLLaVA achieves better results onimage benchmarks as a video LLM, while achieving state-of-the-art performanceacross various video benchmarks, excelling in tasks ranging from captiongeneration to multiple-choice questions, and handling video lengths fromseconds to hours. Codes have been available athttps://github.com/farewellthree/PPLLaVA.

Code Repositories

farewellthree/ppllava
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-based-generative-performancePPLLaVA-7B
Consistency: 3.20
Contextual Understanding: 3.88
Correctness of Information: 3.32
Detail Orientation: 3.20
Temporal Understanding: 3.0
mean: 3.32
video-based-generative-performancePPLLaVA-7B-dpo
Consistency: 3.81
Contextual Understanding: 4.21
Correctness of Information: 3.85
Detail Orientation: 3.56
Temporal Understanding: 3.21
mean: 3.73
video-based-generative-performance-1PPLLaVA-7B
gpt-score: 3.85
video-based-generative-performance-2PPLLaVA-7B
gpt-score: 3.81
video-based-generative-performance-3PPLLaVA-7B
gpt-score: 4.21
video-based-generative-performance-4PPLLaVA-7B
gpt-score: 3.56
video-based-generative-performance-5PPLLaVA-7B
gpt-score: 3.21
video-question-answering-on-mvbenchPPLLaVA (7b)
Avg.: 59.2
zeroshot-video-question-answer-on-activitynetPPLLaVA-7B
Accuracy: 60.7
Confidence Score: 3.6
zeroshot-video-question-answer-on-msrvtt-qaPPLLaVA-7B
Accuracy: 64.3
Confidence Score: 3.5
zeroshot-video-question-answer-on-msvd-qaPPLLaVA-7B
Accuracy: 77.1
Confidence Score: 4.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Papers | HyperAI