HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Qinghao Ye Guohai Xu Ming Yan Haiyang Xu Qi Qian Ji Zhang Fei Huang

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Abstract

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

Benchmarks

BenchmarkMethodologyMetrics
video-captioning-on-msr-vtt-1HiTeA
BLEU-4: 49.2
CIDEr: 65.1
METEOR: 30.7
ROUGE-L: 65.0
video-captioning-on-msvd-1HiTeA
BLEU-4: 71.0
CIDEr: 146.9
METEOR: 45.3
ROUGE-L: 81.4
video-question-answering-on-msrvtt-mcHiTeA
Accuracy: 97.4
video-question-answering-on-next-qaHiTeA
Accuracy: 63.1
video-retrieval-on-activitynetHiTeA
text-to-video R@1: 49.7
text-to-video R@10: 86.7
text-to-video R@5: 77.1
video-retrieval-on-didemoHiTeA
text-to-video R@1: 56.5
text-to-video R@10: 89.7
text-to-video R@5: 81.7
video-retrieval-on-lsmdcHiTeA
text-to-video R@1: 28.7
text-to-video R@10: 59.0
text-to-video R@5: 50.3
video-retrieval-on-msr-vtt-1kaHiTeA
text-to-video R@1: 46.8
text-to-video R@10: 81.9
text-to-video R@5: 71.2
video-retrieval-on-ssv2-label-retrievalHiTeA
text-to-video R@1: 55.2
text-to-video R@10: 81.4
text-to-video R@5: 89.1
video-retrieval-on-ssv2-template-retrievalHiTeA
text-to-video R@1: 85.6
text-to-video R@10: 100
text-to-video R@5: 100
visual-question-answering-on-msrvtt-qa-1HiTeA
Accuracy: 0.459
visual-question-answering-on-msvd-qa-1HiTeA
Accuracy: 0.556
visual-question-answering-on-tgif-qaHiTeA
Accuracy: 0.732
zero-shot-learning-on-msrvtt-qaHiTeA
Accuracy: 21.7
zero-shot-learning-on-msvd-qaHiTeA
Accuracy: 37.4
zero-shot-video-retrieval-on-didemoHiTeA-17M
text-to-video R@1: 43.2
text-to-video R@10: 79.0
text-to-video R@5: 69.3
zero-shot-video-retrieval-on-didemoHiTeA-5M
text-to-video R@1: 36.1
text-to-video R@10: 70.3
text-to-video R@5: 60.1
zero-shot-video-retrieval-on-lsmdcHiTeA-17M
text-to-video R@1: 18.3
text-to-video R@10: 44.2
text-to-video R@5: 36.7
zero-shot-video-retrieval-on-lsmdcHiTeA-5M
text-to-video R@1: 15.5
text-to-video R@10: 39.8
text-to-video R@5: 31.1
zero-shot-video-retrieval-on-msr-vttHiTeA-5M
text-to-video R@1: 29.9
text-to-video R@10: 62.9
text-to-video R@5: 54.2
zero-shot-video-retrieval-on-msr-vttHiTeA-17M
text-to-video R@1: 34.4
text-to-video R@10: 69.9
text-to-video R@5: 60.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Papers | HyperAI