3 months ago

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Qinghao Ye Guohai Xu Ming Yan Haiyang Xu Qi Qian Ji Zhang Fei Huang

Abstract

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

Benchmarks

Benchmark	Methodology	Metrics
video-captioning-on-msr-vtt-1	HiTeA	BLEU-4: 49.2 CIDEr: 65.1 METEOR: 30.7 ROUGE-L: 65.0
video-captioning-on-msvd-1	HiTeA	BLEU-4: 71.0 CIDEr: 146.9 METEOR: 45.3 ROUGE-L: 81.4
video-question-answering-on-msrvtt-mc	HiTeA	Accuracy: 97.4
video-question-answering-on-next-qa	HiTeA	Accuracy: 63.1
video-retrieval-on-activitynet	HiTeA	text-to-video R@1: 49.7 text-to-video R@10: 86.7 text-to-video R@5: 77.1
video-retrieval-on-didemo	HiTeA	text-to-video R@1: 56.5 text-to-video R@10: 89.7 text-to-video R@5: 81.7
video-retrieval-on-lsmdc	HiTeA	text-to-video R@1: 28.7 text-to-video R@10: 59.0 text-to-video R@5: 50.3
video-retrieval-on-msr-vtt-1ka	HiTeA	text-to-video R@1: 46.8 text-to-video R@10: 81.9 text-to-video R@5: 71.2
video-retrieval-on-ssv2-label-retrieval	HiTeA	text-to-video R@1: 55.2 text-to-video R@10: 81.4 text-to-video R@5: 89.1
video-retrieval-on-ssv2-template-retrieval	HiTeA	text-to-video R@1: 85.6 text-to-video R@10: 100 text-to-video R@5: 100
visual-question-answering-on-msrvtt-qa-1	HiTeA	Accuracy: 0.459
visual-question-answering-on-msvd-qa-1	HiTeA	Accuracy: 0.556
visual-question-answering-on-tgif-qa	HiTeA	Accuracy: 0.732
zero-shot-learning-on-msrvtt-qa	HiTeA	Accuracy: 21.7
zero-shot-learning-on-msvd-qa	HiTeA	Accuracy: 37.4
zero-shot-video-retrieval-on-didemo	HiTeA-17M	text-to-video R@1: 43.2 text-to-video R@10: 79.0 text-to-video R@5: 69.3
zero-shot-video-retrieval-on-didemo	HiTeA-5M	text-to-video R@1: 36.1 text-to-video R@10: 70.3 text-to-video R@5: 60.1
zero-shot-video-retrieval-on-lsmdc	HiTeA-17M	text-to-video R@1: 18.3 text-to-video R@10: 44.2 text-to-video R@5: 36.7
zero-shot-video-retrieval-on-lsmdc	HiTeA-5M	text-to-video R@1: 15.5 text-to-video R@10: 39.8 text-to-video R@5: 31.1
zero-shot-video-retrieval-on-msr-vtt	HiTeA-5M	text-to-video R@1: 29.9 text-to-video R@10: 62.9 text-to-video R@5: 54.2
zero-shot-video-retrieval-on-msr-vtt	HiTeA-17M	text-to-video R@1: 34.4 text-to-video R@10: 69.9 text-to-video R@5: 60.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning