5 months ago

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu Yilin Zhao Daquan Zhou Zhijie Lin See Kiong Ng Jiashi Feng

Abstract

Vision-language pre-training has significantly elevated performance across awide range of image-language applications. Yet, the pre-training process forvideo-related tasks demands exceptionally large computational and dataresources, which hinders the progress of video-language models. This paperinvestigates a straightforward, highly efficient, and resource-light approachto adapting an existing image-language pre-trained model for dense videounderstanding. Our preliminary experiments reveal that directly fine-tuningpre-trained image-language models with multiple frames as inputs on videodatasets leads to performance saturation or even a drop. Our furtherinvestigation reveals that it is largely attributed to the bias of learnedhigh-norm visual features. Motivated by this finding, we propose a simple buteffective pooling strategy to smooth the feature distribution along thetemporal dimension and thus reduce the dominant impacts from the extremefeatures. The new model is termed Pooling LLaVA, or in short. achieves new state-of-the-art performance on modern benchmarkdatasets for both video question-answer and captioning tasks. Notably, on therecent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of5 on average of five evaluated dimensions, exceeding the previous SOTA resultsfrom GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench,PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higherthan GPT4V (IG-VLM). Code is available athttps://github.com/magic-research/PLLaVA.

Code Repositories

magic-research/PLLaVA

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-based-generative-performance	PLLaVA-34B	Consistency: 3.25 Contextual Understanding: 3.90 Correctness of Information: 3.60 Detail Orientation: 3.20 Temporal Understanding: 2.67 mean: 3.32
video-based-generative-performance-1	PLLaVA-34B	gpt-score: 3.60
video-based-generative-performance-2	PLLaVA-34B	gpt-score: 3.25
video-based-generative-performance-3	PLLaVA-34B	gpt-score: 3.9
video-based-generative-performance-4	PLLaVA-34B	gpt-score: 3.20
video-based-generative-performance-5	PLLaVA-34B	gpt-score: 2.67
video-question-answering-on-mvbench	PLLaVA	Avg.: 58.1
video-question-answering-on-tvbench	PLLaVA-34B	Average Accuracy: 42.3
video-question-answering-on-tvbench	PLLaVA-7B	Average Accuracy: 34.9
video-question-answering-on-tvbench	PLLaVA-13B	Average Accuracy: 36.4
zeroshot-video-question-answer-on-activitynet	PLLaVA (34B)	Accuracy: 60.9 Confidence Score: 3.7
zeroshot-video-question-answer-on-msrvtt-qa	PLLaVA (34B)	Accuracy: 68.7 Confidence Score: 3.6
zeroshot-video-question-answer-on-msvd-qa	PLLaVA (34B)	Accuracy: 79.9 Confidence Score: 4.2
zeroshot-video-question-answer-on-tgif-qa	PLLaVA	Accuracy: 80.6 Confidence Score: 4.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette