4 个月前

InternVideo2:扩展基础模型以实现多模态视频理解

InternVideo2:扩展基础模型以实现多模态视频理解

摘要

我们介绍了InternVideo2,这是一系列新的视频基础模型(ViFM),在视频识别、视频-文本任务以及以视频为中心的对话方面取得了最先进的成果。我们的核心设计是一种渐进式训练方法,该方法统一了掩码视频建模、跨模态对比学习和下一个标记预测,将视频编码器的参数规模扩展至60亿。在数据层面,我们通过语义分割视频并生成视频-音频-语音字幕来优先考虑时空一致性,从而提高了视频与文本之间的对齐度。通过广泛的实验,我们验证了我们的设计,并展示了在超过60个视频和音频任务上的卓越性能。特别值得一提的是,我们的模型在各种与视频相关的对话和长视频理解基准测试中超越了其他模型,突显了其推理和理解较长上下文的能力。代码和模型可在https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/ 获取。

代码仓库

opengvlab/internvideo
官方
pytorch
GitHub 中提及
opengvlab/internvideo2
官方
pytorch
GitHub 中提及

基准测试

基准方法指标
action-classification-on-kinetics-400InternVideo2-1B
Acc@1: 91.6
action-classification-on-kinetics-400InternVideo2-6B
Acc@1: 92.1
action-classification-on-kinetics-600InternVideo2-1B
Top-1 Accuracy: 91.6
action-classification-on-kinetics-600InternVideo2-6B
Top-1 Accuracy: 91.9
action-classification-on-kinetics-700InternVideo2-1B
Top-1 Accuracy: 85.4
action-classification-on-kinetics-700InternVideo2-6B
Top-1 Accuracy: 85.9
action-classification-on-mitInternVideo2-6B
Top 1 Accuracy: 51.2
action-classification-on-moments-in-timeInternVideo2-1B
Top 1 Accuracy: 50.9
action-recognition-in-videos-on-activitynetInternVideo2-6B
mAP: 95.9
action-recognition-in-videos-on-somethingInternVideo2-6B
GFLOPs: 13321
Parameters: 2131
Top-1 Accuracy: 1
Top-5 Accuracy: 12
action-recognition-in-videos-on-somethingInternVideo2-1B
Top-1 Accuracy: 77.1
action-recognition-on-hacsInternVideo2-6B
Top 1 Accuracy: 97.0
audio-classification-on-esc-50InternVideo2
Accuracy (5-fold): 98.6
PRE-TRAINING DATASET: Multiple
Top-1 Accuracy: 98.6
moment-retrieval-on-charades-staInternVideo2-6B
R@1 IoU=0.5: 70.03
R@1 IoU=0.7: 48.95
moment-retrieval-on-charades-staInternVideo2-1B
R@1 IoU=0.5: 68.36
R@1 IoU=0.7: 45.03
moment-retrieval-on-qvhighlightsInternVideo2-6B
R@1 IoU=0.5: 71.42
R@1 IoU=0.7: 56.45
mAP: 49.24
temporal-action-localization-on-activitynetInternVideo2-6B
mAP: 41.2
temporal-action-localization-on-activitynetInternVideo2-1B
mAP: 40.4
temporal-action-localization-on-fineactionInternVideo2-6B
mAP: 27.7
temporal-action-localization-on-hacsInternVideo2-1B
Average-mAP: 42.4
temporal-action-localization-on-hacsInternVideo2-6B
Average-mAP: 43.3
temporal-action-localization-on-thumos14InternVideo2-1B
Avg mAP (0.3:0.7): 69.8
temporal-action-localization-on-thumos14InternVideo2-6B
Avg mAP (0.3:0.7): 72.0
video-grounding-on-qvhighlightsInternVideo2-6B
R@1,IoU=0.5: 71.42
R@1,IoU=0.7: 56.45
video-grounding-on-qvhighlightsInternVideo2-1B
R@1,IoU=0.5: 70.00
R@1,IoU=0.7: 54.45
video-question-answering-on-mvbenchInternVideo2
Avg.: 67.2
video-question-answering-on-perception-testInternVideo2 (8B)
Accuracy (Top-1): 63.4
video-retrieval-on-activitynetInternVideo2-6B
text-to-video R@1: 74.1
video-to-text R@1: 69.7
video-retrieval-on-didemoInternVideo2-6B
text-to-video R@1: 74.2
video-to-text R@1: 71.9
video-retrieval-on-lsmdcInternVideo2-6B
text-to-video R@1: 46.4
video-to-text R@1: 46.7
video-retrieval-on-msr-vttInternVideo2-6B
text-to-video R@1: 62.8
video-to-text R@1: 60.2
video-retrieval-on-msvdInternVideo2-6B
text-to-video R@1: 61.4
video-to-text R@1: 85.2
video-retrieval-on-vatexInternVideo2-6B
text-to-video R@1: 75.5
video-to-text R@1: 89.3
zero-shot-video-question-answer-on-egoschema-1InternVideo2-6B
Accuracy: 60.2
zero-shot-video-question-answer-on-mvbenchInternVideo2-1B
Accuracy: 60.9
zero-shot-video-retrieval-on-activitynetInternVideo2-1B
text-to-video R@1: 60.4
text-to-video R@10: 90.8
text-to-video R@5: 83.9
video-to-text R@1: 54.8
video-to-text R@10: 89.5
video-to-text R@5: 81.5
zero-shot-video-retrieval-on-activitynetInternVideo2-6B
text-to-video R@1: 63.2
text-to-video R@10: 92.5
text-to-video R@5: 85.6
video-to-text R@1: 56.5
video-to-text R@10: 90.3
video-to-text R@5: 82.8
zero-shot-video-retrieval-on-didemoInternVideo2-6B
text-to-video R@1: 57.9
text-to-video R@10: 84.6
text-to-video R@5: 80.0
video-to-text R@1: 57.1
video-to-text R@10: 85.0
video-to-text R@5: 79.9
zero-shot-video-retrieval-on-didemoInternVideo2-1B
text-to-video R@1: 57.0
text-to-video R@10: 85.1
text-to-video R@5: 80.0
video-to-text R@1: 54.3
video-to-text R@10: 83.5
video-to-text R@5: 77.2
zero-shot-video-retrieval-on-lsmdcInternVideo2-6B
text-to-video R@1: 33.8
text-to-video R@10: 62.2
text-to-video R@5: 55.9
video-to-text R@1: 30.1
video-to-text R@10: 54.8
video-to-text R@5: 47.7
zero-shot-video-retrieval-on-lsmdcInternVideo2-1B
text-to-video R@1: 32.0
text-to-video R@10: 59.4
text-to-video R@5: 52.4
video-to-text R@1: 27.3
video-to-text R@10: 51.6
video-to-text R@5: 44.2
zero-shot-video-retrieval-on-msr-vttInternVideo2-1B
text-to-video R@1: 51.9
text-to-video R@10: 82.5
text-to-video R@5: 75.3
video-to-text R@1: 50.9
video-to-text R@10: 81.8
video-to-text R@5: 73.4
zero-shot-video-retrieval-on-msr-vttInternVideo2-6B
text-to-video R@1: 55.9
text-to-video R@10: 85.1
text-to-video R@5: 78.3
video-to-text R@1: 53.7
video-to-text R@10: 84.1
video-to-text R@5: 77.5
zero-shot-video-retrieval-on-msvdInternVideo2-1B
text-to-video R@1: 58.1
text-to-video R@10: 88.4
text-to-video R@5: 83.0
video-to-text R@1: 83.3
video-to-text R@10: 96.9
video-to-text R@5: 94.3
zero-shot-video-retrieval-on-msvdInternVideo2-6B
text-to-video R@1: 59.3
text-to-video R@10: 89.6
text-to-video R@5: 84.4
video-to-text R@1: 83.1
video-to-text R@10: 97.0
video-to-text R@5: 94.2
zero-shot-video-retrieval-on-vatexInternVideo2-6B
text-to-video R@1: 71.5
text-to-video R@10: 97.1
text-to-video R@5: 94.0
video-to-text R@1: 85.3
video-to-text R@10: 99.3
video-to-text R@5: 97.9
zero-shot-video-retrieval-on-vatexInternVideo2-1B
text-to-video R@1: 70.4
text-to-video R@10: 96.9
text-to-video R@5: 93.4
video-to-text R@1: 85.4
video-to-text R@10: 99.1
video-to-text R@5: 97.6

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
InternVideo2:扩展基础模型以实现多模态视频理解 | 论文 | HyperAI超神经