WangYi ; LiKunchang ; LiXinhao ; YuJiashuo ; HeYinan ; WangChenting ; ChenGuo ; PeiBaoqi ; YanZiang ; ZhengRongkun ; XuJilan ; WangZun ; ShiYansong ; JiangTianxiang ; LiSongze ; ZhangHongjie ; HuangYifei ; QiaoYu ; WangYali ; WangLimin

摘要
我们介绍了InternVideo2,这是一系列新的视频基础模型(ViFM),在视频识别、视频-文本任务以及以视频为中心的对话方面取得了最先进的成果。我们的核心设计是一种渐进式训练方法,该方法统一了掩码视频建模、跨模态对比学习和下一个标记预测,将视频编码器的参数规模扩展至60亿。在数据层面,我们通过语义分割视频并生成视频-音频-语音字幕来优先考虑时空一致性,从而提高了视频与文本之间的对齐度。通过广泛的实验,我们验证了我们的设计,并展示了在超过60个视频和音频任务上的卓越性能。特别值得一提的是,我们的模型在各种与视频相关的对话和长视频理解基准测试中超越了其他模型,突显了其推理和理解较长上下文的能力。代码和模型可在https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/ 获取。
代码仓库
opengvlab/internvideo
官方
pytorch
GitHub 中提及
opengvlab/internvideo2
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | InternVideo2-1B | Acc@1: 91.6 |
| action-classification-on-kinetics-400 | InternVideo2-6B | Acc@1: 92.1 |
| action-classification-on-kinetics-600 | InternVideo2-1B | Top-1 Accuracy: 91.6 |
| action-classification-on-kinetics-600 | InternVideo2-6B | Top-1 Accuracy: 91.9 |
| action-classification-on-kinetics-700 | InternVideo2-1B | Top-1 Accuracy: 85.4 |
| action-classification-on-kinetics-700 | InternVideo2-6B | Top-1 Accuracy: 85.9 |
| action-classification-on-mit | InternVideo2-6B | Top 1 Accuracy: 51.2 |
| action-classification-on-moments-in-time | InternVideo2-1B | Top 1 Accuracy: 50.9 |
| action-recognition-in-videos-on-activitynet | InternVideo2-6B | mAP: 95.9 |
| action-recognition-in-videos-on-something | InternVideo2-6B | GFLOPs: 13321 Parameters: 2131 Top-1 Accuracy: 1 Top-5 Accuracy: 12 |
| action-recognition-in-videos-on-something | InternVideo2-1B | Top-1 Accuracy: 77.1 |
| action-recognition-on-hacs | InternVideo2-6B | Top 1 Accuracy: 97.0 |
| audio-classification-on-esc-50 | InternVideo2 | Accuracy (5-fold): 98.6 PRE-TRAINING DATASET: Multiple Top-1 Accuracy: 98.6 |
| moment-retrieval-on-charades-sta | InternVideo2-6B | R@1 IoU=0.5: 70.03 R@1 IoU=0.7: 48.95 |
| moment-retrieval-on-charades-sta | InternVideo2-1B | R@1 IoU=0.5: 68.36 R@1 IoU=0.7: 45.03 |
| moment-retrieval-on-qvhighlights | InternVideo2-6B | R@1 IoU=0.5: 71.42 R@1 IoU=0.7: 56.45 mAP: 49.24 |
| temporal-action-localization-on-activitynet | InternVideo2-6B | mAP: 41.2 |
| temporal-action-localization-on-activitynet | InternVideo2-1B | mAP: 40.4 |
| temporal-action-localization-on-fineaction | InternVideo2-6B | mAP: 27.7 |
| temporal-action-localization-on-hacs | InternVideo2-1B | Average-mAP: 42.4 |
| temporal-action-localization-on-hacs | InternVideo2-6B | Average-mAP: 43.3 |
| temporal-action-localization-on-thumos14 | InternVideo2-1B | Avg mAP (0.3:0.7): 69.8 |
| temporal-action-localization-on-thumos14 | InternVideo2-6B | Avg mAP (0.3:0.7): 72.0 |
| video-grounding-on-qvhighlights | InternVideo2-6B | R@1,IoU=0.5: 71.42 R@1,IoU=0.7: 56.45 |
| video-grounding-on-qvhighlights | InternVideo2-1B | R@1,IoU=0.5: 70.00 R@1,IoU=0.7: 54.45 |
| video-question-answering-on-mvbench | InternVideo2 | Avg.: 67.2 |
| video-question-answering-on-perception-test | InternVideo2 (8B) | Accuracy (Top-1): 63.4 |
| video-retrieval-on-activitynet | InternVideo2-6B | text-to-video R@1: 74.1 video-to-text R@1: 69.7 |
| video-retrieval-on-didemo | InternVideo2-6B | text-to-video R@1: 74.2 video-to-text R@1: 71.9 |
| video-retrieval-on-lsmdc | InternVideo2-6B | text-to-video R@1: 46.4 video-to-text R@1: 46.7 |
| video-retrieval-on-msr-vtt | InternVideo2-6B | text-to-video R@1: 62.8 video-to-text R@1: 60.2 |
| video-retrieval-on-msvd | InternVideo2-6B | text-to-video R@1: 61.4 video-to-text R@1: 85.2 |
| video-retrieval-on-vatex | InternVideo2-6B | text-to-video R@1: 75.5 video-to-text R@1: 89.3 |
| zero-shot-video-question-answer-on-egoschema-1 | InternVideo2-6B | Accuracy: 60.2 |
| zero-shot-video-question-answer-on-mvbench | InternVideo2-1B | Accuracy: 60.9 |
| zero-shot-video-retrieval-on-activitynet | InternVideo2-1B | text-to-video R@1: 60.4 text-to-video R@10: 90.8 text-to-video R@5: 83.9 video-to-text R@1: 54.8 video-to-text R@10: 89.5 video-to-text R@5: 81.5 |
| zero-shot-video-retrieval-on-activitynet | InternVideo2-6B | text-to-video R@1: 63.2 text-to-video R@10: 92.5 text-to-video R@5: 85.6 video-to-text R@1: 56.5 video-to-text R@10: 90.3 video-to-text R@5: 82.8 |
| zero-shot-video-retrieval-on-didemo | InternVideo2-6B | text-to-video R@1: 57.9 text-to-video R@10: 84.6 text-to-video R@5: 80.0 video-to-text R@1: 57.1 video-to-text R@10: 85.0 video-to-text R@5: 79.9 |
| zero-shot-video-retrieval-on-didemo | InternVideo2-1B | text-to-video R@1: 57.0 text-to-video R@10: 85.1 text-to-video R@5: 80.0 video-to-text R@1: 54.3 video-to-text R@10: 83.5 video-to-text R@5: 77.2 |
| zero-shot-video-retrieval-on-lsmdc | InternVideo2-6B | text-to-video R@1: 33.8 text-to-video R@10: 62.2 text-to-video R@5: 55.9 video-to-text R@1: 30.1 video-to-text R@10: 54.8 video-to-text R@5: 47.7 |
| zero-shot-video-retrieval-on-lsmdc | InternVideo2-1B | text-to-video R@1: 32.0 text-to-video R@10: 59.4 text-to-video R@5: 52.4 video-to-text R@1: 27.3 video-to-text R@10: 51.6 video-to-text R@5: 44.2 |
| zero-shot-video-retrieval-on-msr-vtt | InternVideo2-1B | text-to-video R@1: 51.9 text-to-video R@10: 82.5 text-to-video R@5: 75.3 video-to-text R@1: 50.9 video-to-text R@10: 81.8 video-to-text R@5: 73.4 |
| zero-shot-video-retrieval-on-msr-vtt | InternVideo2-6B | text-to-video R@1: 55.9 text-to-video R@10: 85.1 text-to-video R@5: 78.3 video-to-text R@1: 53.7 video-to-text R@10: 84.1 video-to-text R@5: 77.5 |
| zero-shot-video-retrieval-on-msvd | InternVideo2-1B | text-to-video R@1: 58.1 text-to-video R@10: 88.4 text-to-video R@5: 83.0 video-to-text R@1: 83.3 video-to-text R@10: 96.9 video-to-text R@5: 94.3 |
| zero-shot-video-retrieval-on-msvd | InternVideo2-6B | text-to-video R@1: 59.3 text-to-video R@10: 89.6 text-to-video R@5: 84.4 video-to-text R@1: 83.1 video-to-text R@10: 97.0 video-to-text R@5: 94.2 |
| zero-shot-video-retrieval-on-vatex | InternVideo2-6B | text-to-video R@1: 71.5 text-to-video R@10: 97.1 text-to-video R@5: 94.0 video-to-text R@1: 85.3 video-to-text R@10: 99.3 video-to-text R@5: 97.9 |
| zero-shot-video-retrieval-on-vatex | InternVideo2-1B | text-to-video R@1: 70.4 text-to-video R@10: 96.9 text-to-video R@5: 93.4 video-to-text R@1: 85.4 video-to-text R@10: 99.1 video-to-text R@5: 97.6 |