5 months ago

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Di Qi; Lin Su; Jia Song; Edward Cui; Taroon Bharti; Arun Sacheti

Abstract

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-cross-modal-retrieval-on-coco-2014	ImageBERT	Image-to-text R@1: 44.0 Image-to-text R@10: 80.4 Image-to-text R@5: 71.2 Text-to-image R@1: 32.3 Text-to-image R@10: 70.2 Text-to-image R@5: 59.0
zero-shot-cross-modal-retrieval-on-flickr30k	ImageBERT	Image-to-text R@1: 70.7 Image-to-text R@10: 94.0 Image-to-text R@5: 90.2 Text-to-image R@1: 54.3 Text-to-image R@10: 87.5 Text-to-image R@5: 79.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning