Command Palette
Search for a command to run...
Yan Zeng Xinsong Zhang Hang Li Jiawei Wang Jipeng Zhang Wangchunshu Zhou

Abstract
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| cross-modal-retrieval-on-coco-2014 | X2-VLM (base) | Image-to-text R@1: 83.5 Image-to-text R@10: 98.5 Image-to-text R@5: 96.3 Text-to-image R@1: 66.2 Text-to-image R@10: 92.2 Text-to-image R@5: 87.1 |
| cross-modal-retrieval-on-coco-2014 | X2-VLM (large) | Image-to-text R@1: 84.4 Image-to-text R@10: 98.5 Image-to-text R@5: 96.5 Text-to-image R@1: 67.7 Text-to-image R@10: 92.5 Text-to-image R@5: 87.5 |
| cross-modal-retrieval-on-flickr30k | X2-VLM (base) | Image-to-text R@1: 98.5 Image-to-text R@10: 100 Image-to-text R@5: 100 Text-to-image R@1: 90.4 Text-to-image R@10: 99.3 Text-to-image R@5: 98.2 |
| cross-modal-retrieval-on-flickr30k | X2-VLM (large) | Image-to-text R@1: 98.8 Image-to-text R@10: 100 Image-to-text R@5: 100 Text-to-image R@1: 91.8 Text-to-image R@10: 99.5 Text-to-image R@5: 98.6 |
| video-retrieval-on-msr-vtt-1ka | X2-VLM (large) | text-to-video R@1: 49.6 text-to-video R@10: 84.2 text-to-video R@5: 76.7 |
| video-retrieval-on-msr-vtt-1ka | X2-VLM (base) | text-to-video R@1: 47.6 text-to-video R@10: 84.2 text-to-video R@5: 74.1 |
| visual-grounding-on-refcoco-test-b | X2-VLM (base) | Accuracy (%): 78.4 |
| visual-grounding-on-refcoco-test-b | X2-VLM (large) | Accuracy (%): 81.8 |
| visual-grounding-on-refcoco-testa | X2-VLM (large) | Accuracy (%): 92.1 |
| visual-grounding-on-refcoco-testa | X2-VLM (base) | Accuracy (%): 90.3 |
| visual-grounding-on-refcoco-val | X2-VLM (base) | Accuracy (%): 85.2 |
| visual-grounding-on-refcoco-val | X2-VLM (large) | Accuracy (%): 87.6 |
| visual-question-answering-on-msrvtt-qa-1 | X2-VLM (base) | Accuracy: 0.45 |
| visual-question-answering-on-msrvtt-qa-1 | X2-VLM (large) | Accuracy: 0.455 |
| visual-question-answering-on-msvd-qa-1 | X2-VLM (base) | Accuracy: 0.528 |
| visual-question-answering-on-msvd-qa-1 | X2-VLM (large) | Accuracy: 0.546 |
| visual-question-answering-on-vqa-v2-test-dev | X2-VLM (base) | Accuracy: 80.4 |
| visual-question-answering-on-vqa-v2-test-dev | X2-VLM (large) | Accuracy: 81.9 |
| visual-question-answering-on-vqa-v2-test-std | X2-VLM (large) | overall: 81.8 |
| visual-question-answering-on-vqa-v2-test-std | X2-VLM (base) | overall: 80.2 |
| visual-reasoning-on-nlvr2-dev | X2-VLM (large) | Accuracy: 88.7 |
| visual-reasoning-on-nlvr2-dev | X2-VLM (base) | Accuracy: 86.2 |
| visual-reasoning-on-nlvr2-test | X2-VLM (large) | Accuracy: 89.4 |
| visual-reasoning-on-nlvr2-test | X2-VLM (base) | Accuracy: 87.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.