HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Di Qi; Lin Su; Jia Song; Edward Cui; Taroon Bharti; Arun Sacheti

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Abstract

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-cross-modal-retrieval-on-coco-2014ImageBERT
Image-to-text R@1: 44.0
Image-to-text R@10: 80.4
Image-to-text R@5: 71.2
Text-to-image R@1: 32.3
Text-to-image R@10: 70.2
Text-to-image R@5: 59.0
zero-shot-cross-modal-retrieval-on-flickr30kImageBERT
Image-to-text R@1: 70.7
Image-to-text R@10: 94.0
Image-to-text R@5: 90.2
Text-to-image R@1: 54.3
Text-to-image R@10: 87.5
Text-to-image R@5: 79.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data | Papers | HyperAI