4 个月前

未加掩码的教师:迈向训练高效的视频基础模型

未加掩码的教师:迈向训练高效的视频基础模型

摘要

视频基础模型(VFMs)由于高昂的计算成本和数据稀缺而受到有限的探索。以往的VFM依赖于图像基础模型(IFMs),但在向视频领域的迁移过程中面临诸多挑战。尽管VideoMAE已经从有限的数据中训练出一个稳健的视觉变换器(ViT),但其低层次重建导致了收敛困难,并且与高层次跨模态对齐存在冲突。本文提出了一种时间敏感型VFM的高效训练方法,该方法整合了现有方法的优点。为了提高数据效率,我们屏蔽了大部分低语义视频标记,但选择性地将未屏蔽的标记与IFM对齐,后者作为未屏蔽教师(UMT)。通过提供语义指导,我们的方法实现了更快的收敛速度和多模态友好性。借助渐进式预训练框架,我们的模型能够处理包括场景相关、时间相关以及复杂的视频-语言理解在内的多种任务。仅使用公开资源,在32个A100 GPU上进行6天的预训练,我们从零构建的ViT-L/16在各种视频任务上达到了最先进的性能。代码和模型将在https://github.com/OpenGVLab/unmasked_teacher发布。

代码仓库

opengvlab/unmasked_teacher
官方
pytorch
GitHub 中提及

基准测试

基准方法指标
action-classification-on-kinetics-400UMT-L (ViT-L/16)
Acc@1: 90.6
Acc@5: 98.7
action-classification-on-kinetics-400Unmasked Teacher (ViT-L)
Acc@1: 90.6
Acc@5: 98.7
FLOPs (G) x views: 1434×3×4
Parameters (M): 304
action-classification-on-kinetics-600UMT-L (ViT-L/16)
Top-1 Accuracy: 90.5
Top-5 Accuracy: 98.8
action-classification-on-kinetics-700UMT-L (ViT-L/16)
Top-1 Accuracy: 83.6
Top-5 Accuracy: 96.7
action-classification-on-moments-in-timeUMT-L (ViT-L/16)
Top 1 Accuracy: 48.7
Top 5 Accuracy: 78.2
action-recognition-on-ava-v2-2UMT-L (ViT-L/16)
mAP: 39.8
video-question-answering-on-activitynet-qaUMT-L (ViT-L/16)
Accuracy: 47.9
video-retrieval-on-activitynetUMT-L (ViT-L/16)
text-to-video R@1: 66.8
text-to-video R@10: 94.9
text-to-video R@5: 89.1
video-to-text R@1: 64.4
video-to-text R@10: 94.8
video-to-text R@5: 89.1
video-retrieval-on-didemoUMT-L (ViT-L/16)
text-to-video R@1: 70.4
text-to-video R@10: 93.5
text-to-video R@5: 90.1
video-to-text R@1: 65.7
video-to-text R@10: 93.3
video-to-text R@5: 89.6
video-retrieval-on-lsmdcUMT-L (ViT-L/16)
text-to-video R@1: 43.0
text-to-video R@10: 73.0
text-to-video R@5: 65.5
video-to-text R@1: 41.4
video-to-text R@10: 71.5
video-to-text R@5: 64.3
video-retrieval-on-msr-vttUMT-L (ViT-L/16)
text-to-video R@1: 58.8
text-to-video R@10: 87.1
text-to-video R@5: 81.0
video-to-text R@1: 58.6
video-to-text R@10: 86.5
video-to-text R@5: 81.6
video-retrieval-on-ssv2-label-retrievalUMT-L (ViT-L/16)
text-to-video R@1: 73.3
text-to-video R@10: 96.6
text-to-video R@5: 92.7
video-retrieval-on-ssv2-template-retrievalUMT-L (ViT-L/16)
text-to-video R@1: 90.8
text-to-video R@10: 100.0
text-to-video R@5: 100.0
video-retrieval-on-vatexUnmasked Teacher
text-to-video R@1: 72
text-to-video R@10: 97.8
text-to-video R@5: 95.1
video-to-text R@1: 86.0
video-to-text R@10: 99.6
visual-question-answering-on-msrvtt-qa-1UMT-L (ViT-L/16)
Accuracy: 0.471
visual-question-answering-on-msvd-qa-1UMT-L (ViT-L/16)
Accuracy: 0.552
zero-shot-video-retrieval-on-activitynetUMT-L (ViT-L/16)
text-to-video R@1: 42.8
text-to-video R@10: 79.8
text-to-video R@5: 69.6
video-to-text R@1: 40.7
video-to-text R@10: 78.6
video-to-text R@5: 67.6
zero-shot-video-retrieval-on-didemoUMT-L (ViT-L/16)
text-to-video R@1: 48.6
text-to-video R@10: 79.0
text-to-video R@5: 72.9
video-to-text R@1: 49.9
video-to-text R@10: 81.4
video-to-text R@5: 74.8
zero-shot-video-retrieval-on-lsmdcUMT-L (ViT-L/16)
text-to-video R@1: 25.2
text-to-video R@10: 50.5
text-to-video R@5: 43.0
video-to-text R@1: 23.2
video-to-text R@10: 44.2
video-to-text R@5: 37.7
zero-shot-video-retrieval-on-msr-vttUMT-L (ViT-L/16)
text-to-video R@1: 42.6
text-to-video R@10: 73.1
text-to-video R@5: 64.4
video-to-text R@1: 38.6
video-to-text R@10: 69.6
video-to-text R@5: 59.8
zero-shot-video-retrieval-on-msvdUMT-L (ViT-L/16)
text-to-video R@1: 49.0
text-to-video R@10: 84.7
text-to-video R@5: 76.9
video-to-text R@1: 74.5
video-to-text R@10: 92.8
video-to-text R@5: 89.7

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
未加掩码的教师:迈向训练高效的视频基础模型 | 论文 | HyperAI超神经