HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Tsu-Jui Fu Linjie Li Zhe Gan Kevin Lin William Yang Wang Lijuan Wang Zicheng Liu

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Abstract

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
video-captioning-on-msr-vtt-1VIOLETv2
CIDEr: 58
video-captioning-on-msvd-1VIOLETv2
CIDEr: 139.2
video-question-answering-on-lsmdc-mcVIOLETv2
Accuracy: 84.4
video-question-answering-on-msrvtt-mcVIOLETv2
Accuracy: 97.6
video-question-answering-on-msrvtt-qaVIOLETv2
Accuracy: 44.5
video-retrieval-on-didemoVIOLETv2
text-to-video R@1: 47.9
text-to-video R@10: 84.1
text-to-video R@5: 76.5
video-retrieval-on-lsmdcVIOLETv2
text-to-video R@1: 24
text-to-video R@10: 54.1
text-to-video R@5: 43.5
video-retrieval-on-msr-vttVIOLETv2
text-to-video R@1: 37.2
text-to-video R@10: 75.8
text-to-video R@5: 64.8
visual-question-answering-on-msvd-qa-1VIOLETv2
Accuracy: 0.547

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Papers | HyperAI