3 个月前

VideoCoCa:基于对比描述生成模型的零样本迁移视频-文本建模

VideoCoCa:基于对比描述生成模型的零样本迁移视频-文本建模

摘要

我们提出了一种高效构建基础视频-文本模型的方法。本文介绍了VideoCoCa,该模型最大限度地复用了预训练的图像-文本对比生成模型(CoCa),并通过极少的额外训练即可将其适配至视频-文本任务。与以往工作通过引入多种跨帧融合模块来改造图像-文本模型不同,我们发现CoCa中的生成式注意力池化(generative attentional pooling)和对比式注意力池化(contrastive attentional pooling)层可直接应用于展平后的帧嵌入表示,从而在零样本视频分类和零样本文本到视频检索任务上取得了当前最优性能。此外,我们在VideoCoCa的基础上进一步探索了轻量级微调策略,在视频问答和视频字幕生成任务上也取得了优异的结果。

基准测试

基准方法指标
video-captioning-on-activitynet-captionsVideoCoCa
BLEU4: 14.7
CIDEr: 39.3
ROUGE-L: 35.0
video-captioning-on-msr-vtt-1VideoCoCa
BLEU-4: 53.8
CIDEr: 73.2
ROUGE-L: 68.0
video-captioning-on-vatex-1VideoCoCa
BLEU-4: 39.7
CIDEr: 77.8
ROUGE-L: 54.5
video-captioning-on-youcook2VideoCoCa
BLEU-4: 14.2
CIDEr: 1.28
ROUGE-L: 37.7
video-question-answering-on-activitynet-qaVideoCoCa
Accuracy: 56.1
video-question-answering-on-ivqaVideoCoCa
Accuracy: 39.0
video-retrieval-on-msr-vttVideoCoCa (zero-shot)
text-to-video R@1: 34.3
text-to-video R@10: 67.0
text-to-video R@5: 57.8
video-to-text R@1: 64.7
video-to-text R@10: 91.4
video-to-text R@5: 85.2
video-retrieval-on-youcook2VideoCoCa (zero-shot)
text-to-video R@1: 21.7
text-to-video R@10: 55.2
text-to-video R@5: 43.9
visual-question-answering-on-msrvtt-qa-1VideoCoCa
Accuracy: 0.463
visual-question-answering-on-msvd-qa-1VideoCoCa
Accuracy: 0.569
zero-shot-action-recognition-on-charades-1VideoCoCa
mAP: 25.8
zero-shot-action-recognition-on-hmdb51VideoCoCa
Top-1 Accuracy: 58.7
Top-5 Accuracy: 84.5
zero-shot-action-recognition-on-kineticsVideoCoCa
Top-1 Accuracy: 70.1
Top-5 Accuracy: 88.9
zero-shot-action-recognition-on-ucf101VideoCoCa
Top-1 Accuracy: 86.6
Top-5 accuracy: 98.4
zero-shot-video-retrieval-on-activitynetVideoCoCa
text-to-video R@1: 34.5
text-to-video R@10: 76.6
text-to-video R@5: 63.2
video-to-text R@1: 33.0
video-to-text R@10: 75.3
video-to-text R@5: 61.6
zero-shot-video-retrieval-on-msr-vtt-fullVideoCoCa
text-to-video R@1: 34.3
text-to-video R@10: 67.0
text-to-video R@5: 57.8
video-to-text R@1: 64.7
video-to-text R@10: 91.4
video-to-text R@5: 85.2
zero-shot-video-retrieval-on-vatexVideoCoCa
text-to-video R@1: 53.2
text-to-video R@10: 90.1
text-to-video R@5: 83.3
video-to-text R@1: 73.6
video-to-text R@10: 97.2
video-to-text R@5: 93.2
zero-shot-video-retrieval-on-youcook2VideoCOca
text-to-video R@1: 20.3
text-to-video R@10: 53.3
text-to-video R@5: 43.0

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
VideoCoCa:基于对比描述生成模型的零样本迁移视频-文本建模 | 论文 | HyperAI超神经