HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li; Yali Wang; Yinan He; Yizhuo Li; Yi Wang; Yi Liu; Zun Wang; Jilan Xu; Guo Chen; Ping Luo; Limin Wang; Yu Qiao

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Code Repositories

bytedance/tarsier
pytorch
Mentioned in GitHub
opengvlab/ask-anything
Official
pytorch
Mentioned in GitHub
magic-research/PLLaVA
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
vcgbench-diverse-on-videoinstructVideoChat2
Consistency: 2.27
Contextual Understanding: 2.51
Correctness of Information: 2.13
Dense Captioning: 1.26
Detail Orientation: 2.42
Reasoning: 3.13
Spatial Understanding: 2.43
Temporal Understanding: 1.66
mean: 2.20
video-based-generative-performanceVideoChat2_HD_mistral
Consistency: 2.84
Contextual Understanding: 3.72
Correctness of Information: 3.40
Detail Orientation: 2.91
Temporal Understanding: 2.65
mean: 3.10
video-based-generative-performanceVideoChat2
Consistency: 2.81
Contextual Understanding: 3.51
Correctness of Information: 3.02
Detail Orientation: 2.88
Temporal Understanding: 2.66
mean: 2.98
video-based-generative-performance-1VideoChat2
gpt-score: 3.02
video-based-generative-performance-1VideoChat2_HD_mistral
gpt-score: 3.40
video-based-generative-performance-2VideoChat2
gpt-score: 2.81
video-based-generative-performance-2VideoChat2_HD_mistral
gpt-score: 2.62
video-based-generative-performance-3VideoChat2_HD_mistral
gpt-score: 3.64
video-based-generative-performance-3VideoChat2
gpt-score: 3.51
video-based-generative-performance-4VideoChat2
gpt-score: 2.88
video-based-generative-performance-4VideoChat2_HD_mistral
gpt-score: 2.86
video-based-generative-performance-5VideoChat2
gpt-score: 2.66
video-based-generative-performance-5VideoChat2_HD_mistral
gpt-score: 2.65
video-question-answering-on-activitynet-qaVideoChat2
Accuracy: 49.1
Confidence score: 3.3
video-question-answering-on-intentqaVideoChat2_mistral
Accuarcy: 81.9
CH: 86.9
CW: 82.6
TPu0026TN: 77.0
video-question-answering-on-intentqaVideoChat2_HD_mistral
Accuarcy: 83.4
CH: 90.0
CW: 84.0
TPu0026TN: 77.3
video-question-answering-on-mvbenchVideoChat2
Avg.: 51.9
video-question-answering-on-next-qaVideoChat2_HD_mistral
Accuracy: 79.5
video-question-answering-on-next-qaVideoChat2_mistral
Accuracy: 78.6
video-question-answering-on-next-qaVideoChat2
Accuracy: 68.6
video-question-answering-on-tvbenchVideoChat2
Average Accuracy: 35.0
zero-shot-learning-on-tvqaVideoChat2
Accuracy: 40.6
zero-shot-video-question-answer-on-egoschemaVideoChat2_HD_mistral
Accuracy: 65.6
zero-shot-video-question-answer-on-egoschemaVideoChat2_mistral
Accuracy: 63.6
zero-shot-video-question-answer-on-egoschema-1VideoChat2_mistral
Accuracy: 54.4
zero-shot-video-question-answer-on-egoschema-1VideoChat2_HD_mistral
Accuracy: 55.8
zero-shot-video-question-answer-on-egoschema-1VideoChat2_phi3
Accuracy: 56.7
zero-shot-video-question-answer-on-next-qaVideoChat2
Accuracy: 61.7
zero-shot-video-question-answer-on-starVideoChat2
Accuracy: 59.0
zero-shot-video-question-answer-on-tvqaVideoChat2 (no speech)
Accuracy: 40.6
zero-shot-video-question-answer-on-tvqaVideoChat_HD_mistral (no speech)
Accuracy: 50.6
zero-shot-video-question-answer-on-tvqaVideoChat_mistral (no speech)
Accuracy: 46.4
zeroshot-video-question-answer-on-activitynetVideoChat2
Accuracy: 49.1
Confidence Score: 3.3
zeroshot-video-question-answer-on-msrvtt-qaVideoChat2
Accuracy: 54.1
Confidence Score: 3.3
zeroshot-video-question-answer-on-msvd-qaVideoChat2
Accuracy: 70.0
Confidence Score: 3.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Papers | HyperAI