Command Palette
Search for a command to run...
Yen-Chun Chen; Linjie Li; Licheng Yu; Ahmed El Kholy; Faisal Ahmed; Zhe Gan; Yu Cheng; Jingjing Liu

Abstract
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-entailment-on-snli-ve-test | UNITER (Large) | Accuracy: 78.98 |
| visual-entailment-on-snli-ve-val | UNITER | Accuracy: 78.98 |
| visual-question-answering-on-vcr-q-a-test | UNITER (Large) | Accuracy: 77.3 |
| visual-question-answering-on-vcr-q-a-test | UNITER-large (10 ensemble) | Accuracy: 79.8 |
| visual-question-answering-on-vcr-q-ar-test | UNITER (Large) | Accuracy: 62.8 |
| visual-question-answering-on-vcr-qa-r-test | UNITER-large (ensemble of 10 models) | Accuracy: 83.4 |
| visual-question-answering-on-vcr-qa-r-test | UNITER (Large) | Accuracy: 80.8 |
| visual-question-answering-on-vqa-v2-test-dev | UNITER (Large) | Accuracy: 73.24 |
| visual-question-answering-on-vqa-v2-test-std | UNITER (Large) | overall: 73.4 |
| visual-reasoning-on-nlvr2-test | UNITER (Large) | Accuracy: 79.5 |
| zero-shot-cross-modal-retrieval-on-flickr30k | UNITER | Image-to-text R@1: 80.7 Image-to-text R@10: 98.0 Image-to-text R@5: 95.7 Text-to-image R@1: 66.2 Text-to-image R@10: 92.9 Text-to-image R@5: 88.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.