Command Palette
Search for a command to run...
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan; Weichong Yin; Yu Sun; Hao Tian; Hua Wu; Haifeng Wang

Abstract
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| cross-modal-retrieval-on-coco-2014 | ERNIE-ViL 2.0 | Image-to-text R@1: 77.4 Image-to-text R@10: 97.1 Image-to-text R@5: 93.6 Text-to-image R@1: 59.5 Text-to-image R@10: 90.1 Text-to-image R@5: 83.4 |
| cross-modal-retrieval-on-flickr30k | ERNIE-ViL 2.0 | Image-to-text R@1: 97.2 Image-to-text R@10: 100.0 Image-to-text R@5: 100.0 Text-to-image R@1: 93.3 Text-to-image R@10: 99.8 Text-to-image R@5: 99.4 |
| image-retrieval-on-aic-icc | ERNIE-ViL2.0 | Recall@1: 19.0 Recall@10: 43.5 Recall@5: 35.3 |
| image-to-text-retrieval-on-aic-icc | ERNIE-ViL2.0 | Recall@1: 33.7 Recall@10: 60.0 Recall@5: 52.1 |
| image-to-text-retrieval-on-flickr30k | ERNIE-ViL 2.0 | Recall@1: 96.1 Recall@10: 100.0 Recall@5: 99.9 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | ERNIE-ViL 2.0 | Image-to-text R@1: 63.1 Image-to-text R@10: 91.4 Image-to-text R@5: 85.7 Text-to-image R@1: 46.0 Text-to-image R@10: 80.4 Text-to-image R@5: 71.4 |
| zero-shot-cross-modal-retrieval-on-flickr30k | ERNIE-ViL 2.0 | Image-to-text R@1: 91.2 Image-to-text R@10: 99.8 Image-to-text R@5: 99.1 Text-to-image R@1: 77.4 Text-to-image R@10: 96.4 Text-to-image R@5: 93.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.