
摘要
本文首次尝试开发一种端到端的以聊天为中心的视频理解系统,命名为VideoChat。该系统通过可学习的神经接口集成了视频基础模型和大规模语言模型,在时空推理、事件定位和因果关系推断方面表现出色。为了指导性地调整这一系统,我们构建了一个以视频为中心的指令数据集,包含数千个与详细描述和对话相关联的视频。该数据集强调时空推理并捕捉因果关系,为训练我们的以聊天为中心的视频理解系统提供了宝贵的资源。初步的定性实验展示了我们的系统在广泛视频应用中的潜力,可以作为未来研究中以聊天为中心的视频理解系统的简单原型系统。代码和数据可在https://github.com/OpenGVLab/Ask-Anything 获取。
代码仓库
opengvlab/ask-anything
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| question-answering-on-next-qa-open-ended | VideoChat | Accuracy: 56.6 Confidence Score: 3.2 |
| video-based-generative-performance | Video Chat | Consistency: 2.24 Contextual Understanding: 2.53 Correctness of Information: 2.23 Detail Orientation: 2.50 Temporal Understanding: 1.94 mean: 2.29 |
| video-based-generative-performance-1 | Video Chat | gpt-score: 2.32 |
| video-based-generative-performance-2 | Video Chat | gpt-score: 2.24 |
| video-based-generative-performance-3 | Video Chat | gpt-score: 2.53 |
| video-based-generative-performance-4 | Video Chat | gpt-score: 2.50 |
| video-based-generative-performance-5 | Video Chat | gpt-score: 1.94 |
| video-question-answering-on-activitynet-qa | Video Chat | Accuracy: 26.5 Confidence score: 2.2 |
| video-question-answering-on-mvbench | VideoChat | Avg.: 35.5 |
| zeroshot-video-question-answer-on-activitynet | Video Chat | Accuracy: 26.5 Confidence Score: 2.2 |
| zeroshot-video-question-answer-on-msrvtt-qa | Video Chat-7B | Accuracy: 45.0 Confidence Score: 2.5 |
| zeroshot-video-question-answer-on-msvd-qa | Video Chat-7B | Accuracy: 56.3 Confidence Score: 2.8 |
| zeroshot-video-question-answer-on-tgif-qa | Video Chat-7B | Accuracy: 34.4 Confidence Score: 2.3 |