HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan Tao Zhu Zirui Wang Yuan Cao Mi Zhang Soham Ghosh Yonghui Wu Jiahui Yu

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Abstract

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

Benchmarks

BenchmarkMethodologyMetrics
video-captioning-on-activitynet-captionsVideoCoCa
BLEU4: 14.7
CIDEr: 39.3
ROUGE-L: 35.0
video-captioning-on-msr-vtt-1VideoCoCa
BLEU-4: 53.8
CIDEr: 73.2
ROUGE-L: 68.0
video-captioning-on-vatex-1VideoCoCa
BLEU-4: 39.7
CIDEr: 77.8
ROUGE-L: 54.5
video-captioning-on-youcook2VideoCoCa
BLEU-4: 14.2
CIDEr: 1.28
ROUGE-L: 37.7
video-question-answering-on-activitynet-qaVideoCoCa
Accuracy: 56.1
video-question-answering-on-ivqaVideoCoCa
Accuracy: 39.0
video-retrieval-on-msr-vttVideoCoCa (zero-shot)
text-to-video R@1: 34.3
text-to-video R@10: 67.0
text-to-video R@5: 57.8
video-to-text R@1: 64.7
video-to-text R@10: 91.4
video-to-text R@5: 85.2
video-retrieval-on-youcook2VideoCoCa (zero-shot)
text-to-video R@1: 21.7
text-to-video R@10: 55.2
text-to-video R@5: 43.9
visual-question-answering-on-msrvtt-qa-1VideoCoCa
Accuracy: 0.463
visual-question-answering-on-msvd-qa-1VideoCoCa
Accuracy: 0.569
zero-shot-action-recognition-on-charades-1VideoCoCa
mAP: 25.8
zero-shot-action-recognition-on-hmdb51VideoCoCa
Top-1 Accuracy: 58.7
Top-5 Accuracy: 84.5
zero-shot-action-recognition-on-kineticsVideoCoCa
Top-1 Accuracy: 70.1
Top-5 Accuracy: 88.9
zero-shot-action-recognition-on-ucf101VideoCoCa
Top-1 Accuracy: 86.6
Top-5 accuracy: 98.4
zero-shot-video-retrieval-on-activitynetVideoCoCa
text-to-video R@1: 34.5
text-to-video R@10: 76.6
text-to-video R@5: 63.2
video-to-text R@1: 33.0
video-to-text R@10: 75.3
video-to-text R@5: 61.6
zero-shot-video-retrieval-on-msr-vtt-fullVideoCoCa
text-to-video R@1: 34.3
text-to-video R@10: 67.0
text-to-video R@5: 57.8
video-to-text R@1: 64.7
video-to-text R@10: 91.4
video-to-text R@5: 85.2
zero-shot-video-retrieval-on-vatexVideoCoCa
text-to-video R@1: 53.2
text-to-video R@10: 90.1
text-to-video R@5: 83.3
video-to-text R@1: 73.6
video-to-text R@10: 97.2
video-to-text R@5: 93.2
zero-shot-video-retrieval-on-youcook2VideoCOca
text-to-video R@1: 20.3
text-to-video R@10: 53.3
text-to-video R@5: 43.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | Papers | HyperAI