Command Palette
Search for a command to run...
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue Tiankai Hang Yanhong Zeng Yuchong Sun Bei Liu Huan Yang Jianlong Fu Baining Guo

Abstract
We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 40.4% R@1 in zero-shot MSR-VTT text-to-video retrieval task and 55.4% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual editing and super-resolution tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-retrieval-on-activitynet | HD-VILA | text-to-video Median Rank: 4 text-to-video R@1: 28.5 text-to-video R@5: 57.4 text-to-video R@50: 94 |
| video-retrieval-on-didemo | HD-VILA | text-to-video Median Rank: 4 text-to-video R@1: 28.8 text-to-video R@10: 69.1 text-to-video R@5: 57.4 |
| video-retrieval-on-lsmdc | HD-VILA | text-to-video Median Rank: 15 text-to-video R@1: 17.4 text-to-video R@10: 44.1 text-to-video R@5: 34.1 |
| video-retrieval-on-msr-vtt | HD-VILA | text-to-video MedianR: 3 text-to-video R@1: 35.6 text-to-video R@10: 78 text-to-video R@5: 65.3 |
| zero-shot-video-retrieval-on-msr-vtt | HD-VILA | text-to-video Median Rank: 15 text-to-video R@1: 14.6 text-to-video R@10: 44.1 text-to-video R@5: 34.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.