HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti Ming Zhou

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Code Repositories

wqliu657/UniVL
pytorch
Mentioned in GitHub
microsoft/UniVL
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-segmentation-on-coinUnivl
Frame accuracy: 70.0
video-captioning-on-youcook2UniVL
BLEU-3: 23.87
BLEU-4: 17.35
CIDEr: 1.81
METEOR: 22.35
ROUGE-L: 46.52
video-retrieval-on-msr-vttUniVL
text-to-video Median Rank: 6
text-to-video R@1: 21.2
text-to-video R@10: 63.1
text-to-video R@5: 49.6
video-retrieval-on-youcook2UniVL
text-to-video Median Rank: 4
text-to-video R@1: 28.9
text-to-video R@10: 70.0
text-to-video R@5: 57.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | Papers | HyperAI