6 months ago

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Multimodal Representation

Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti Ming Zhou

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 months ago

Multimodal Representation

Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti Ming Zhou

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | Papers | HyperAI

Command Palette

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti Ming Zhou

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti Ming Zhou

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti Ming Zhou

Abstract

Build AI with AI

HyperAI Newsletters