Command Palette
Search for a command to run...
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Abstract
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-captioning-on-aic-icc | CMCL | BLEU: 66.1 CIDEr: 220.7 METEOR: 41.1 ROUGE-L: 71.9 |
| image-retrieval-on-aic-icc | CMCL | Recall@1: 14.4 Recall@10: 39.1 Recall@5: 39.1 |
| image-retrieval-on-ruc-cas-wenlan | CMCL | Recall@1: 36 Recall@10: 62.1 Recall@5: 55.4 |
| image-to-text-retrieval-on-aic-icc | CMCL | Recall@1: 20.3 Recall@10: 45.6 Recall@5: 37 |
| image-to-text-retrieval-on-ruc-cas-wenlan | CMCL | Recall@1: 36.1 Recall@10: 62.2 Recall@5: 55.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.