
摘要
由大型语言模型(LLMs)驱动的对话代理为与视觉数据的交互提供了一种新的方式。尽管已经有一些初步尝试构建基于图像的对话模型,但本研究通过引入Video-ChatGPT,探讨了尚未充分开发的基于视频的对话领域。Video-ChatGPT是一种多模态模型,它将视频适应的视觉编码器与大型语言模型相结合。该模型能够理解和生成关于视频的详细对话。我们引入了一个包含100,000个视频指令对的新数据集,这些数据对通过手动和半自动管道获取,具有易于扩展且对标签噪声鲁棒的特点。此外,我们还开发了一个定量评估框架,用于客观分析基于视频的对话模型的优势和不足。代码:https://github.com/mbzuai-oryx/Video-ChatGPT。
代码仓库
mbzuai-oryx/video-chatgpt
官方
pytorch
GitHub 中提及
qiujihao19/artemis
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| question-answering-on-next-qa-open-ended | Video-ChatGPT | Accuracy: 54.6 Confidence Score: 3.2 |
| vcgbench-diverse-on-videoinstruct | Video-ChatGPT | Consistency: 2.06 Contextual Understanding: 2.46 Correctness of Information: 2.07 Dense Captioning: 0.89 Detail Orientation: 2.42 Reasoning: 3.60 Spatial Understanding: 2.25 Temporal Understanding: 1.39 mean: 2.08 |
| video-based-generative-performance | Video-ChatGPT | Consistency: 2.37 Contextual Understanding: 2.62 Correctness of Information: 2.4 Detail Orientation: 2.52 Temporal Understanding: 1.98 mean: 2.38 |
| video-based-generative-performance-1 | Video-ChatGPT | gpt-score: 2.40 |
| video-based-generative-performance-2 | Video-ChatGPT | gpt-score: 2.37 |
| video-based-generative-performance-3 | Video-ChatGPT | gpt-score: 2.62 |
| video-based-generative-performance-4 | Video-ChatGPT | gpt-score: 2.52 |
| video-based-generative-performance-5 | Video-ChatGPT | gpt-score: 1.98 |
| video-question-answering-on-activitynet-qa | Video-ChatGPT | Accuracy: 35.2 Confidence score: 2.7 |
| video-question-answering-on-mvbench | Video-ChatGPT | Avg.: 32.7 |
| zeroshot-video-question-answer-on-activitynet | Video-ChatGPT | Accuracy: 35.2 Confidence Score: 2.7 |
| zeroshot-video-question-answer-on-msrvtt-qa | Video-ChatGPT-7B | Accuracy: 49.3 Confidence Score: 2.8 |
| zeroshot-video-question-answer-on-msvd-qa | Video-ChatGPT-7B | Accuracy: 64.9 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-tgif-qa | Video-ChatGPT-7B | Accuracy: 51.4 Confidence Score: 3.0 |