HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Dongxu Li Junnan Li Hongdong Li Juan Carlos Niebles Steven C.H. Hoi

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Abstract

Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e~normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models are available at https://github.com/salesforce/ALPRO.

Code Repositories

salesforce/alpro
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-didemoALPRO
text-to-video Median Rank: 3
text-to-video R@1: 35.9
text-to-video R@10: 78.8
text-to-video R@5: 67.5
visual-question-answering-on-msrvtt-qa-1ALPRO
Accuracy: 0.421
visual-question-answering-on-msvd-qa-1ALPRO
Accuracy: 0.459
zero-shot-video-retrieval-on-didemoALPRO
text-to-video Median Rank: 6
text-to-video R@1: 23.8
text-to-video R@10: 57.9
text-to-video R@5: 47.3
zero-shot-video-retrieval-on-msr-vttALPRO
text-to-video Median Rank: 8
text-to-video R@1: 24.1
text-to-video R@10: 55.4
text-to-video R@5: 44.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | Papers | HyperAI