
摘要
我们提出了一种高效构建基础视频-文本模型的方法。本文介绍了VideoCoCa,该模型最大限度地复用了预训练的图像-文本对比生成模型(CoCa),并通过极少的额外训练即可将其适配至视频-文本任务。与以往工作通过引入多种跨帧融合模块来改造图像-文本模型不同,我们发现CoCa中的生成式注意力池化(generative attentional pooling)和对比式注意力池化(contrastive attentional pooling)层可直接应用于展平后的帧嵌入表示,从而在零样本视频分类和零样本文本到视频检索任务上取得了当前最优性能。此外,我们在VideoCoCa的基础上进一步探索了轻量级微调策略,在视频问答和视频字幕生成任务上也取得了优异的结果。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| video-captioning-on-activitynet-captions | VideoCoCa | BLEU4: 14.7 CIDEr: 39.3 ROUGE-L: 35.0 |
| video-captioning-on-msr-vtt-1 | VideoCoCa | BLEU-4: 53.8 CIDEr: 73.2 ROUGE-L: 68.0 |
| video-captioning-on-vatex-1 | VideoCoCa | BLEU-4: 39.7 CIDEr: 77.8 ROUGE-L: 54.5 |
| video-captioning-on-youcook2 | VideoCoCa | BLEU-4: 14.2 CIDEr: 1.28 ROUGE-L: 37.7 |
| video-question-answering-on-activitynet-qa | VideoCoCa | Accuracy: 56.1 |
| video-question-answering-on-ivqa | VideoCoCa | Accuracy: 39.0 |
| video-retrieval-on-msr-vtt | VideoCoCa (zero-shot) | text-to-video R@1: 34.3 text-to-video R@10: 67.0 text-to-video R@5: 57.8 video-to-text R@1: 64.7 video-to-text R@10: 91.4 video-to-text R@5: 85.2 |
| video-retrieval-on-youcook2 | VideoCoCa (zero-shot) | text-to-video R@1: 21.7 text-to-video R@10: 55.2 text-to-video R@5: 43.9 |
| visual-question-answering-on-msrvtt-qa-1 | VideoCoCa | Accuracy: 0.463 |
| visual-question-answering-on-msvd-qa-1 | VideoCoCa | Accuracy: 0.569 |
| zero-shot-action-recognition-on-charades-1 | VideoCoCa | mAP: 25.8 |
| zero-shot-action-recognition-on-hmdb51 | VideoCoCa | Top-1 Accuracy: 58.7 Top-5 Accuracy: 84.5 |
| zero-shot-action-recognition-on-kinetics | VideoCoCa | Top-1 Accuracy: 70.1 Top-5 Accuracy: 88.9 |
| zero-shot-action-recognition-on-ucf101 | VideoCoCa | Top-1 Accuracy: 86.6 Top-5 accuracy: 98.4 |
| zero-shot-video-retrieval-on-activitynet | VideoCoCa | text-to-video R@1: 34.5 text-to-video R@10: 76.6 text-to-video R@5: 63.2 video-to-text R@1: 33.0 video-to-text R@10: 75.3 video-to-text R@5: 61.6 |
| zero-shot-video-retrieval-on-msr-vtt-full | VideoCoCa | text-to-video R@1: 34.3 text-to-video R@10: 67.0 text-to-video R@5: 57.8 video-to-text R@1: 64.7 video-to-text R@10: 91.4 video-to-text R@5: 85.2 |
| zero-shot-video-retrieval-on-vatex | VideoCoCa | text-to-video R@1: 53.2 text-to-video R@10: 90.1 text-to-video R@5: 83.3 video-to-text R@1: 73.6 video-to-text R@10: 97.2 video-to-text R@5: 93.2 |
| zero-shot-video-retrieval-on-youcook2 | VideoCOca | text-to-video R@1: 20.3 text-to-video R@10: 53.3 text-to-video R@5: 43.0 |