
摘要
随着多模态大语言模型(MLLMs)的快速发展,近期出现了一系列诊断基准测试,用于评估这些模型的理解能力。然而,大多数基准测试主要集中在静态图像任务的空间理解上,而忽视了动态视频任务的时间理解。为了解决这一问题,我们引入了一个全面的多模态视频理解基准测试——MVBench,该基准涵盖了20个具有挑战性的视频任务,这些任务无法仅通过单帧图像有效解决。具体而言,我们首先提出了一种新颖的静态到动态方法来定义这些时间相关的任务。通过将各种静态任务转化为动态任务,我们能够系统地生成需要广泛时间技能(从感知到认知)的视频任务。然后,在任务定义的指导下,我们将公开的视频注释自动转换为多项选择题问答形式,以评估每个任务。一方面,这种独特的范式使我们能够高效地构建MVBench,减少了大量的人工干预;另一方面,它通过使用真实的视频注释保证了评估的公平性,避免了对大语言模型的偏见评分。此外,我们还进一步开发了一个稳健的视频MLLM基线模型——VideoChat2,通过多样化的指令调优数据进行渐进式的多模态训练。我们在MVBench上的广泛实验结果表明,现有的MLLMs在时间理解方面远未达到令人满意的效果,而我们的VideoChat2在MVBench上大幅超越了这些领先模型,性能提升了超过15%。所有模型和数据均可在https://github.com/OpenGVLab/Ask-Anything获取。
代码仓库
bytedance/tarsier
pytorch
GitHub 中提及
opengvlab/ask-anything
官方
pytorch
GitHub 中提及
magic-research/PLLaVA
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| vcgbench-diverse-on-videoinstruct | VideoChat2 | Consistency: 2.27 Contextual Understanding: 2.51 Correctness of Information: 2.13 Dense Captioning: 1.26 Detail Orientation: 2.42 Reasoning: 3.13 Spatial Understanding: 2.43 Temporal Understanding: 1.66 mean: 2.20 |
| video-based-generative-performance | VideoChat2_HD_mistral | Consistency: 2.84 Contextual Understanding: 3.72 Correctness of Information: 3.40 Detail Orientation: 2.91 Temporal Understanding: 2.65 mean: 3.10 |
| video-based-generative-performance | VideoChat2 | Consistency: 2.81 Contextual Understanding: 3.51 Correctness of Information: 3.02 Detail Orientation: 2.88 Temporal Understanding: 2.66 mean: 2.98 |
| video-based-generative-performance-1 | VideoChat2 | gpt-score: 3.02 |
| video-based-generative-performance-1 | VideoChat2_HD_mistral | gpt-score: 3.40 |
| video-based-generative-performance-2 | VideoChat2 | gpt-score: 2.81 |
| video-based-generative-performance-2 | VideoChat2_HD_mistral | gpt-score: 2.62 |
| video-based-generative-performance-3 | VideoChat2_HD_mistral | gpt-score: 3.64 |
| video-based-generative-performance-3 | VideoChat2 | gpt-score: 3.51 |
| video-based-generative-performance-4 | VideoChat2 | gpt-score: 2.88 |
| video-based-generative-performance-4 | VideoChat2_HD_mistral | gpt-score: 2.86 |
| video-based-generative-performance-5 | VideoChat2 | gpt-score: 2.66 |
| video-based-generative-performance-5 | VideoChat2_HD_mistral | gpt-score: 2.65 |
| video-question-answering-on-activitynet-qa | VideoChat2 | Accuracy: 49.1 Confidence score: 3.3 |
| video-question-answering-on-intentqa | VideoChat2_mistral | Accuarcy: 81.9 CH: 86.9 CW: 82.6 TPu0026TN: 77.0 |
| video-question-answering-on-intentqa | VideoChat2_HD_mistral | Accuarcy: 83.4 CH: 90.0 CW: 84.0 TPu0026TN: 77.3 |
| video-question-answering-on-mvbench | VideoChat2 | Avg.: 51.9 |
| video-question-answering-on-next-qa | VideoChat2_HD_mistral | Accuracy: 79.5 |
| video-question-answering-on-next-qa | VideoChat2_mistral | Accuracy: 78.6 |
| video-question-answering-on-next-qa | VideoChat2 | Accuracy: 68.6 |
| video-question-answering-on-tvbench | VideoChat2 | Average Accuracy: 35.0 |
| zero-shot-learning-on-tvqa | VideoChat2 | Accuracy: 40.6 |
| zero-shot-video-question-answer-on-egoschema | VideoChat2_HD_mistral | Accuracy: 65.6 |
| zero-shot-video-question-answer-on-egoschema | VideoChat2_mistral | Accuracy: 63.6 |
| zero-shot-video-question-answer-on-egoschema-1 | VideoChat2_mistral | Accuracy: 54.4 |
| zero-shot-video-question-answer-on-egoschema-1 | VideoChat2_HD_mistral | Accuracy: 55.8 |
| zero-shot-video-question-answer-on-egoschema-1 | VideoChat2_phi3 | Accuracy: 56.7 |
| zero-shot-video-question-answer-on-next-qa | VideoChat2 | Accuracy: 61.7 |
| zero-shot-video-question-answer-on-star | VideoChat2 | Accuracy: 59.0 |
| zero-shot-video-question-answer-on-tvqa | VideoChat2 (no speech) | Accuracy: 40.6 |
| zero-shot-video-question-answer-on-tvqa | VideoChat_HD_mistral (no speech) | Accuracy: 50.6 |
| zero-shot-video-question-answer-on-tvqa | VideoChat_mistral (no speech) | Accuracy: 46.4 |
| zeroshot-video-question-answer-on-activitynet | VideoChat2 | Accuracy: 49.1 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-msrvtt-qa | VideoChat2 | Accuracy: 54.1 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-msvd-qa | VideoChat2 | Accuracy: 70.0 Confidence Score: 3.9 |