Command Palette
Search for a command to run...
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Abstract
Vision-language (VL) pre-training has recently received considerableattention. However, most existing end-to-end pre-training approaches eitheronly aim to tackle VL tasks such as image-text retrieval, visual questionanswering (VQA) and image captioning that test high-level understanding ofimages, or only target region-level understanding for tasks such as phrasegrounding and object detection. We present FIBER (Fusion-In-the-Backbone-basedtransformER), a new VL model architecture that can seamlessly handle both thesetypes of tasks. Instead of having dedicated transformer layers for fusion afterthe uni-modal backbones, FIBER pushes multimodal fusion deep into the model byinserting cross-attention into the image and text backbones, bringing gains interms of memory and performance. In addition, unlike previous work that iseither only pre-trained on image-text data or on fine-grained data withbox-level annotations, we present a two-stage pre-training strategy that usesboth these kinds of data efficiently: (i) coarse-grained pre-training based onimage-text data; followed by (ii) fine-grained pre-training based onimage-text-box data. We conduct comprehensive experiments on a wide range of VLtasks, ranging from VQA, image captioning, and retrieval, to phrase grounding,referring expression comprehension, and object detection. Using deep multimodalfusion coupled with the two-stage pre-training, FIBER provides consistentperformance improvements over strong baselines across all tasks, oftenoutperforming methods using magnitudes more data. Code is available athttps://github.com/microsoft/FIBER.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| described-object-detection-on-description | FIBER-B | Intra-scenario ABS mAP: 26.0 Intra-scenario FULL mAP: 22.7 Intra-scenario PRES mAP: 21.5 |
| object-detection-on-coco-o | FIBER-B (Swin-B) | Average mAP: 33.7 Effective Robustness: 11.43 |
| phrase-grounding-on-flickr30k-entities-dev | Fiber-B | R@1: 87.1 R@10: 97.4 R@5: 96.1 |
| phrase-grounding-on-flickr30k-entities-test | FIBER-B | R@1: 87.4 R@10: 97.6 R@5: 96.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.