Command Palette
Search for a command to run...
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu Yilin Zhao Daquan Zhou Zhijie Lin See Kiong Ng Jiashi Feng

Abstract
Vision-language pre-training has significantly elevated performance across awide range of image-language applications. Yet, the pre-training process forvideo-related tasks demands exceptionally large computational and dataresources, which hinders the progress of video-language models. This paperinvestigates a straightforward, highly efficient, and resource-light approachto adapting an existing image-language pre-trained model for dense videounderstanding. Our preliminary experiments reveal that directly fine-tuningpre-trained image-language models with multiple frames as inputs on videodatasets leads to performance saturation or even a drop. Our furtherinvestigation reveals that it is largely attributed to the bias of learnedhigh-norm visual features. Motivated by this finding, we propose a simple buteffective pooling strategy to smooth the feature distribution along thetemporal dimension and thus reduce the dominant impacts from the extremefeatures. The new model is termed Pooling LLaVA, or in short. achieves new state-of-the-art performance on modern benchmarkdatasets for both video question-answer and captioning tasks. Notably, on therecent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of5 on average of five evaluated dimensions, exceeding the previous SOTA resultsfrom GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench,PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higherthan GPT4V (IG-VLM). Code is available athttps://github.com/magic-research/PLLaVA.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-based-generative-performance | PLLaVA-34B | Consistency: 3.25 Contextual Understanding: 3.90 Correctness of Information: 3.60 Detail Orientation: 3.20 Temporal Understanding: 2.67 mean: 3.32 |
| video-based-generative-performance-1 | PLLaVA-34B | gpt-score: 3.60 |
| video-based-generative-performance-2 | PLLaVA-34B | gpt-score: 3.25 |
| video-based-generative-performance-3 | PLLaVA-34B | gpt-score: 3.9 |
| video-based-generative-performance-4 | PLLaVA-34B | gpt-score: 3.20 |
| video-based-generative-performance-5 | PLLaVA-34B | gpt-score: 2.67 |
| video-question-answering-on-mvbench | PLLaVA | Avg.: 58.1 |
| video-question-answering-on-tvbench | PLLaVA-34B | Average Accuracy: 42.3 |
| video-question-answering-on-tvbench | PLLaVA-7B | Average Accuracy: 34.9 |
| video-question-answering-on-tvbench | PLLaVA-13B | Average Accuracy: 36.4 |
| zeroshot-video-question-answer-on-activitynet | PLLaVA (34B) | Accuracy: 60.9 Confidence Score: 3.7 |
| zeroshot-video-question-answer-on-msrvtt-qa | PLLaVA (34B) | Accuracy: 68.7 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-msvd-qa | PLLaVA (34B) | Accuracy: 79.9 Confidence Score: 4.2 |
| zeroshot-video-question-answer-on-tgif-qa | PLLaVA | Accuracy: 80.6 Confidence Score: 4.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.