Command Palette
Search for a command to run...
Qinghao Ye Guohai Xu Ming Yan Haiyang Xu Qi Qian Ji Zhang Fei Huang

Abstract
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-captioning-on-msr-vtt-1 | HiTeA | BLEU-4: 49.2 CIDEr: 65.1 METEOR: 30.7 ROUGE-L: 65.0 |
| video-captioning-on-msvd-1 | HiTeA | BLEU-4: 71.0 CIDEr: 146.9 METEOR: 45.3 ROUGE-L: 81.4 |
| video-question-answering-on-msrvtt-mc | HiTeA | Accuracy: 97.4 |
| video-question-answering-on-next-qa | HiTeA | Accuracy: 63.1 |
| video-retrieval-on-activitynet | HiTeA | text-to-video R@1: 49.7 text-to-video R@10: 86.7 text-to-video R@5: 77.1 |
| video-retrieval-on-didemo | HiTeA | text-to-video R@1: 56.5 text-to-video R@10: 89.7 text-to-video R@5: 81.7 |
| video-retrieval-on-lsmdc | HiTeA | text-to-video R@1: 28.7 text-to-video R@10: 59.0 text-to-video R@5: 50.3 |
| video-retrieval-on-msr-vtt-1ka | HiTeA | text-to-video R@1: 46.8 text-to-video R@10: 81.9 text-to-video R@5: 71.2 |
| video-retrieval-on-ssv2-label-retrieval | HiTeA | text-to-video R@1: 55.2 text-to-video R@10: 81.4 text-to-video R@5: 89.1 |
| video-retrieval-on-ssv2-template-retrieval | HiTeA | text-to-video R@1: 85.6 text-to-video R@10: 100 text-to-video R@5: 100 |
| visual-question-answering-on-msrvtt-qa-1 | HiTeA | Accuracy: 0.459 |
| visual-question-answering-on-msvd-qa-1 | HiTeA | Accuracy: 0.556 |
| visual-question-answering-on-tgif-qa | HiTeA | Accuracy: 0.732 |
| zero-shot-learning-on-msrvtt-qa | HiTeA | Accuracy: 21.7 |
| zero-shot-learning-on-msvd-qa | HiTeA | Accuracy: 37.4 |
| zero-shot-video-retrieval-on-didemo | HiTeA-17M | text-to-video R@1: 43.2 text-to-video R@10: 79.0 text-to-video R@5: 69.3 |
| zero-shot-video-retrieval-on-didemo | HiTeA-5M | text-to-video R@1: 36.1 text-to-video R@10: 70.3 text-to-video R@5: 60.1 |
| zero-shot-video-retrieval-on-lsmdc | HiTeA-17M | text-to-video R@1: 18.3 text-to-video R@10: 44.2 text-to-video R@5: 36.7 |
| zero-shot-video-retrieval-on-lsmdc | HiTeA-5M | text-to-video R@1: 15.5 text-to-video R@10: 39.8 text-to-video R@5: 31.1 |
| zero-shot-video-retrieval-on-msr-vtt | HiTeA-5M | text-to-video R@1: 29.9 text-to-video R@10: 62.9 text-to-video R@5: 54.2 |
| zero-shot-video-retrieval-on-msr-vtt | HiTeA-17M | text-to-video R@1: 34.4 text-to-video R@10: 69.9 text-to-video R@5: 60.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.