
摘要
基于语言模型的进展,大型多模态模型(LMMs)在视频理解方面取得了显著改进。尽管当前的视频LMMs利用了先进的大型语言模型(LLMs),但它们依赖于图像编码器或视频编码器来处理视觉输入,每种编码器都有其自身的局限性。图像编码器在捕捉帧序列中的丰富空间细节方面表现出色,但在显式时间上下文方面存在不足,这在包含复杂动作序列的视频中尤为重要。另一方面,视频编码器提供了时间上下文,但由于计算资源的限制,通常只能以较低分辨率处理稀疏帧,导致上下文和空间理解能力下降。为此,我们引入了VideoGPT+,该模型结合了图像编码器(用于详细的空间理解)和视频编码器(用于全局时间上下文建模)的优势。该模型通过将视频分割成较小的片段,并对图像和视频编码器提取的特征应用自适应池化策略来处理视频。我们的架构在多个视频基准测试中展示了改进的性能,包括VCGBench、MVBench和零样本问答任务。此外,我们开发了一个包含112,000个视频指令集的新颖半自动注释管道,进一步提升了模型性能。为了全面评估视频LMMs,我们提出了VCGBench-Diverse,涵盖了生活方式、体育、科学、游戏和监控等18个广泛的视频类别。该基准测试包含4,354个问题-答案对,评估现有LMMs在密集视频字幕生成、空间和时间理解以及复杂推理方面的泛化能力,确保对不同类型的视频及其动态进行全面评估。代码:https://github.com/mbzuai-oryx/VideoGPT-plus.
代码仓库
mbzuai-oryx/videogpt-plus
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| vcgbench-diverse-on-videoinstruct | VideoGPT+ | Consistency: 2.59 Contextual Understanding: 2.81 Correctness of Information: 2.46 Dense Captioning: 1.38 Detail Orientation: 2.73 Reasoning: 3.63 Spatial Understanding: 2.80 Temporal Understanding: 1.78 mean: 2.47 |
| video-based-generative-performance | VideoGPT+ | Consistency: 3.39 Contextual Understanding: 3.74 Correctness of Information: 3.27 Detail Orientation: 3.18 Temporal Understanding: 2.83 mean: 3.28 |
| video-based-generative-performance-1 | VideoGPT+ | gpt-score: 3.27 |
| video-based-generative-performance-2 | VideoGPT+ | gpt-score: 3.39 |
| video-based-generative-performance-3 | VideoGPT+ | gpt-score: 3.74 |
| video-based-generative-performance-4 | VideoGPT+ | gpt-score: 3.18 |
| video-based-generative-performance-5 | VideoGPT+ | gpt-score: 2.83 |
| video-question-answering-on-mvbench | VideoGPT+ | Avg.: 58.7 |
| video-question-answering-on-tvbench | VideoGPT+ | Average Accuracy: 41.7 |
| zeroshot-video-question-answer-on-activitynet | VideoGPT+ | Accuracy: 50.6 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-msrvtt-qa | VideoGPT+ | Accuracy: 60.6 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-msvd-qa | VideoGPT+ | Accuracy: 72.4 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-tgif-qa | VideoGPT+ | Accuracy: 74.6 Confidence Score: 4.1 |