HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Unified Vision-Language Pre-Training for Image Captioning and VQA

Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso Jianfeng Gao

Unified Vision-Language Pre-Training for Image Captioning and VQA

Abstract

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Code Repositories

WebQnA/WebQA_Baseline
pytorch
Mentioned in GitHub
rmokady/clip_prefix_caption
pytorch
Mentioned in GitHub
LuoweiZhou/VLP
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-captioning-on-coco-captions-testUnified VLP
BLEU-4: 36.5
CIDEr: 116.9
METEOR: 28.4
SPICE: 21.2
image-captioning-on-flickr30k-captions-testUnified VLP
BLEU-4: 30.1
CIDEr: 67.4
METEOR: 23
SPICE: 17
visual-question-answering-on-vqa-v2-test-stdUnified VLP
overall: 70.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Unified Vision-Language Pre-Training for Image Captioning and VQA | Papers | HyperAI