5 months ago

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Bin Shan; Weichong Yin; Yu Sun; Hao Tian; Hua Wu; Haifeng Wang

Abstract

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.

Code Repositories

PaddlePaddle/ERNIE

Official

paddle

Benchmarks

Benchmark	Methodology	Metrics
cross-modal-retrieval-on-coco-2014	ERNIE-ViL 2.0	Image-to-text R@1: 77.4 Image-to-text R@10: 97.1 Image-to-text R@5: 93.6 Text-to-image R@1: 59.5 Text-to-image R@10: 90.1 Text-to-image R@5: 83.4
cross-modal-retrieval-on-flickr30k	ERNIE-ViL 2.0	Image-to-text R@1: 97.2 Image-to-text R@10: 100.0 Image-to-text R@5: 100.0 Text-to-image R@1: 93.3 Text-to-image R@10: 99.8 Text-to-image R@5: 99.4
image-retrieval-on-aic-icc	ERNIE-ViL2.0	Recall@1: 19.0 Recall@10: 43.5 Recall@5: 35.3
image-to-text-retrieval-on-aic-icc	ERNIE-ViL2.0	Recall@1: 33.7 Recall@10: 60.0 Recall@5: 52.1
image-to-text-retrieval-on-flickr30k	ERNIE-ViL 2.0	Recall@1: 96.1 Recall@10: 100.0 Recall@5: 99.9
zero-shot-cross-modal-retrieval-on-coco-2014	ERNIE-ViL 2.0	Image-to-text R@1: 63.1 Image-to-text R@10: 91.4 Image-to-text R@5: 85.7 Text-to-image R@1: 46.0 Text-to-image R@10: 80.4 Text-to-image R@5: 71.4
zero-shot-cross-modal-retrieval-on-flickr30k	ERNIE-ViL 2.0	Image-to-text R@1: 91.2 Image-to-text R@10: 99.8 Image-to-text R@5: 99.1 Text-to-image R@1: 77.4 Text-to-image R@10: 96.4 Text-to-image R@5: 93.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette