
摘要
近日,通过整合视频基础模型和大规模语言模型来构建视频理解系统,可以克服特定预定义视觉任务的局限性。然而,现有的系统只能处理帧数非常少的视频。对于长视频而言,计算复杂度、内存成本以及长时间的时间连接性带来了额外的挑战。借鉴阿特金森-希夫林记忆模型(Atkinson-Shiffrin memory model),我们将Transformer中的标记(tokens)作为记忆载体,并结合我们专门设计的记忆机制,提出了MovieChat以应对这些挑战。MovieChat在长视频理解方面达到了最先进的性能,并发布了包含1000部长视频和14000个人工注释的MovieChat-1K基准数据集,用于验证我们方法的有效性。
代码仓库
rese1f/MovieChat
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| question-answering-on-next-qa-open-ended | MovieChat | Accuracy: 49.9 Confidence Score: 2.7 |
| video-based-generative-performance-1 | MovieChat | gpt-score: 2.76 |
| video-based-generative-performance-2 | MovieChat | gpt-score: 2.42 |
| video-based-generative-performance-3 | MovieChat | gpt-score: 3.01 |
| video-based-generative-performance-4 | MovieChat | gpt-score: 2.93 |
| video-based-generative-performance-5 | MovieChat | gpt-score: 2.24 |
| video-question-answering-on-activitynet-qa | MovieChat | Accuracy: 45.7 Confidence score: 3.1 |
| zeroshot-video-question-answer-on-activitynet | MovieChat | Accuracy: 45.7 Confidence Score: 3.1 |
| zeroshot-video-question-answer-on-msrvtt-qa | MovieChat | Accuracy: 52.7 Confidence Score: 2.6 |
| zeroshot-video-question-answer-on-msvd-qa | MovieChat | Accuracy: 75.2 Confidence Score: 2.9 |