5 months ago

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li; Yali Wang; Yinan He; Yizhuo Li; Yi Wang; Yi Liu; Zun Wang; Jilan Xu; Guo Chen; Ping Luo; Limin Wang; Yu Qiao

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Code Repositories

bytedance/tarsier

pytorch

Mentioned in GitHub

opengvlab/ask-anything

Official

pytorch

Mentioned in GitHub

magic-research/PLLaVA

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
vcgbench-diverse-on-videoinstruct	VideoChat2	Consistency: 2.27 Contextual Understanding: 2.51 Correctness of Information: 2.13 Dense Captioning: 1.26 Detail Orientation: 2.42 Reasoning: 3.13 Spatial Understanding: 2.43 Temporal Understanding: 1.66 mean: 2.20
video-based-generative-performance	VideoChat2_HD_mistral	Consistency: 2.84 Contextual Understanding: 3.72 Correctness of Information: 3.40 Detail Orientation: 2.91 Temporal Understanding: 2.65 mean: 3.10
video-based-generative-performance	VideoChat2	Consistency: 2.81 Contextual Understanding: 3.51 Correctness of Information: 3.02 Detail Orientation: 2.88 Temporal Understanding: 2.66 mean: 2.98
video-based-generative-performance-1	VideoChat2	gpt-score: 3.02
video-based-generative-performance-1	VideoChat2_HD_mistral	gpt-score: 3.40
video-based-generative-performance-2	VideoChat2	gpt-score: 2.81
video-based-generative-performance-2	VideoChat2_HD_mistral	gpt-score: 2.62
video-based-generative-performance-3	VideoChat2_HD_mistral	gpt-score: 3.64
video-based-generative-performance-3	VideoChat2	gpt-score: 3.51
video-based-generative-performance-4	VideoChat2	gpt-score: 2.88
video-based-generative-performance-4	VideoChat2_HD_mistral	gpt-score: 2.86
video-based-generative-performance-5	VideoChat2	gpt-score: 2.66
video-based-generative-performance-5	VideoChat2_HD_mistral	gpt-score: 2.65
video-question-answering-on-activitynet-qa	VideoChat2	Accuracy: 49.1 Confidence score: 3.3
video-question-answering-on-intentqa	VideoChat2_mistral	Accuarcy: 81.9 CH: 86.9 CW: 82.6 TPu0026TN: 77.0
video-question-answering-on-intentqa	VideoChat2_HD_mistral	Accuarcy: 83.4 CH: 90.0 CW: 84.0 TPu0026TN: 77.3
video-question-answering-on-mvbench	VideoChat2	Avg.: 51.9
video-question-answering-on-next-qa	VideoChat2_HD_mistral	Accuracy: 79.5
video-question-answering-on-next-qa	VideoChat2_mistral	Accuracy: 78.6
video-question-answering-on-next-qa	VideoChat2	Accuracy: 68.6
video-question-answering-on-tvbench	VideoChat2	Average Accuracy: 35.0
zero-shot-learning-on-tvqa	VideoChat2	Accuracy: 40.6
zero-shot-video-question-answer-on-egoschema	VideoChat2_HD_mistral	Accuracy: 65.6
zero-shot-video-question-answer-on-egoschema	VideoChat2_mistral	Accuracy: 63.6
zero-shot-video-question-answer-on-egoschema-1	VideoChat2_mistral	Accuracy: 54.4
zero-shot-video-question-answer-on-egoschema-1	VideoChat2_HD_mistral	Accuracy: 55.8
zero-shot-video-question-answer-on-egoschema-1	VideoChat2_phi3	Accuracy: 56.7
zero-shot-video-question-answer-on-next-qa	VideoChat2	Accuracy: 61.7
zero-shot-video-question-answer-on-star	VideoChat2	Accuracy: 59.0
zero-shot-video-question-answer-on-tvqa	VideoChat2 (no speech)	Accuracy: 40.6
zero-shot-video-question-answer-on-tvqa	VideoChat_HD_mistral (no speech)	Accuracy: 50.6
zero-shot-video-question-answer-on-tvqa	VideoChat_mistral (no speech)	Accuracy: 46.4
zeroshot-video-question-answer-on-activitynet	VideoChat2	Accuracy: 49.1 Confidence Score: 3.3
zeroshot-video-question-answer-on-msrvtt-qa	VideoChat2	Accuracy: 54.1 Confidence Score: 3.3
zeroshot-video-question-answer-on-msvd-qa	VideoChat2	Accuracy: 70.0 Confidence Score: 3.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li; Yali Wang; Yinan He; Yizhuo Li; Yi Wang; Yi Liu; Zun Wang; Jilan Xu; Guo Chen; Ping Luo; Limin Wang; Yu Qiao

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters