HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Bin Shan; Weichong Yin; Yu Sun; Hao Tian; Hua Wu; Haifeng Wang

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Abstract

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.

Code Repositories

PaddlePaddle/ERNIE
Official
paddle

Benchmarks

BenchmarkMethodologyMetrics
cross-modal-retrieval-on-coco-2014ERNIE-ViL 2.0
Image-to-text R@1: 77.4
Image-to-text R@10: 97.1
Image-to-text R@5: 93.6
Text-to-image R@1: 59.5
Text-to-image R@10: 90.1
Text-to-image R@5: 83.4
cross-modal-retrieval-on-flickr30kERNIE-ViL 2.0
Image-to-text R@1: 97.2
Image-to-text R@10: 100.0
Image-to-text R@5: 100.0
Text-to-image R@1: 93.3
Text-to-image R@10: 99.8
Text-to-image R@5: 99.4
image-retrieval-on-aic-iccERNIE-ViL2.0
Recall@1: 19.0
Recall@10: 43.5
Recall@5: 35.3
image-to-text-retrieval-on-aic-iccERNIE-ViL2.0
Recall@1: 33.7
Recall@10: 60.0
Recall@5: 52.1
image-to-text-retrieval-on-flickr30kERNIE-ViL 2.0
Recall@1: 96.1
Recall@10: 100.0
Recall@5: 99.9
zero-shot-cross-modal-retrieval-on-coco-2014ERNIE-ViL 2.0
Image-to-text R@1: 63.1
Image-to-text R@10: 91.4
Image-to-text R@5: 85.7
Text-to-image R@1: 46.0
Text-to-image R@10: 80.4
Text-to-image R@5: 71.4
zero-shot-cross-modal-retrieval-on-flickr30kERNIE-ViL 2.0
Image-to-text R@1: 91.2
Image-to-text R@10: 99.8
Image-to-text R@5: 99.1
Text-to-image R@1: 77.4
Text-to-image R@10: 96.4
Text-to-image R@5: 93.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training | Papers | HyperAI