
摘要
视频基础模型(VFMs)由于高昂的计算成本和数据稀缺而受到有限的探索。以往的VFM依赖于图像基础模型(IFMs),但在向视频领域的迁移过程中面临诸多挑战。尽管VideoMAE已经从有限的数据中训练出一个稳健的视觉变换器(ViT),但其低层次重建导致了收敛困难,并且与高层次跨模态对齐存在冲突。本文提出了一种时间敏感型VFM的高效训练方法,该方法整合了现有方法的优点。为了提高数据效率,我们屏蔽了大部分低语义视频标记,但选择性地将未屏蔽的标记与IFM对齐,后者作为未屏蔽教师(UMT)。通过提供语义指导,我们的方法实现了更快的收敛速度和多模态友好性。借助渐进式预训练框架,我们的模型能够处理包括场景相关、时间相关以及复杂的视频-语言理解在内的多种任务。仅使用公开资源,在32个A100 GPU上进行6天的预训练,我们从零构建的ViT-L/16在各种视频任务上达到了最先进的性能。代码和模型将在https://github.com/OpenGVLab/unmasked_teacher发布。
代码仓库
opengvlab/unmasked_teacher
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | UMT-L (ViT-L/16) | Acc@1: 90.6 Acc@5: 98.7 |
| action-classification-on-kinetics-400 | Unmasked Teacher (ViT-L) | Acc@1: 90.6 Acc@5: 98.7 FLOPs (G) x views: 1434×3×4 Parameters (M): 304 |
| action-classification-on-kinetics-600 | UMT-L (ViT-L/16) | Top-1 Accuracy: 90.5 Top-5 Accuracy: 98.8 |
| action-classification-on-kinetics-700 | UMT-L (ViT-L/16) | Top-1 Accuracy: 83.6 Top-5 Accuracy: 96.7 |
| action-classification-on-moments-in-time | UMT-L (ViT-L/16) | Top 1 Accuracy: 48.7 Top 5 Accuracy: 78.2 |
| action-recognition-on-ava-v2-2 | UMT-L (ViT-L/16) | mAP: 39.8 |
| video-question-answering-on-activitynet-qa | UMT-L (ViT-L/16) | Accuracy: 47.9 |
| video-retrieval-on-activitynet | UMT-L (ViT-L/16) | text-to-video R@1: 66.8 text-to-video R@10: 94.9 text-to-video R@5: 89.1 video-to-text R@1: 64.4 video-to-text R@10: 94.8 video-to-text R@5: 89.1 |
| video-retrieval-on-didemo | UMT-L (ViT-L/16) | text-to-video R@1: 70.4 text-to-video R@10: 93.5 text-to-video R@5: 90.1 video-to-text R@1: 65.7 video-to-text R@10: 93.3 video-to-text R@5: 89.6 |
| video-retrieval-on-lsmdc | UMT-L (ViT-L/16) | text-to-video R@1: 43.0 text-to-video R@10: 73.0 text-to-video R@5: 65.5 video-to-text R@1: 41.4 video-to-text R@10: 71.5 video-to-text R@5: 64.3 |
| video-retrieval-on-msr-vtt | UMT-L (ViT-L/16) | text-to-video R@1: 58.8 text-to-video R@10: 87.1 text-to-video R@5: 81.0 video-to-text R@1: 58.6 video-to-text R@10: 86.5 video-to-text R@5: 81.6 |
| video-retrieval-on-ssv2-label-retrieval | UMT-L (ViT-L/16) | text-to-video R@1: 73.3 text-to-video R@10: 96.6 text-to-video R@5: 92.7 |
| video-retrieval-on-ssv2-template-retrieval | UMT-L (ViT-L/16) | text-to-video R@1: 90.8 text-to-video R@10: 100.0 text-to-video R@5: 100.0 |
| video-retrieval-on-vatex | Unmasked Teacher | text-to-video R@1: 72 text-to-video R@10: 97.8 text-to-video R@5: 95.1 video-to-text R@1: 86.0 video-to-text R@10: 99.6 |
| visual-question-answering-on-msrvtt-qa-1 | UMT-L (ViT-L/16) | Accuracy: 0.471 |
| visual-question-answering-on-msvd-qa-1 | UMT-L (ViT-L/16) | Accuracy: 0.552 |
| zero-shot-video-retrieval-on-activitynet | UMT-L (ViT-L/16) | text-to-video R@1: 42.8 text-to-video R@10: 79.8 text-to-video R@5: 69.6 video-to-text R@1: 40.7 video-to-text R@10: 78.6 video-to-text R@5: 67.6 |
| zero-shot-video-retrieval-on-didemo | UMT-L (ViT-L/16) | text-to-video R@1: 48.6 text-to-video R@10: 79.0 text-to-video R@5: 72.9 video-to-text R@1: 49.9 video-to-text R@10: 81.4 video-to-text R@5: 74.8 |
| zero-shot-video-retrieval-on-lsmdc | UMT-L (ViT-L/16) | text-to-video R@1: 25.2 text-to-video R@10: 50.5 text-to-video R@5: 43.0 video-to-text R@1: 23.2 video-to-text R@10: 44.2 video-to-text R@5: 37.7 |
| zero-shot-video-retrieval-on-msr-vtt | UMT-L (ViT-L/16) | text-to-video R@1: 42.6 text-to-video R@10: 73.1 text-to-video R@5: 64.4 video-to-text R@1: 38.6 video-to-text R@10: 69.6 video-to-text R@5: 59.8 |
| zero-shot-video-retrieval-on-msvd | UMT-L (ViT-L/16) | text-to-video R@1: 49.0 text-to-video R@10: 84.7 text-to-video R@5: 76.9 video-to-text R@1: 74.5 video-to-text R@10: 92.8 video-to-text R@5: 89.7 |