HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Qingpei Guo; Furong Xu; Hanxiao Zhang; Wang Ren; Ziping Ma; Lin Ju; Jian Wang; Jingdong Chen; Ming Yang

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Abstract

Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-cross-modal-retrieval-on-coco-2014M2-Encoder
Image-to-text R@1: 72.8
Image-to-text R@10: 96.3
Image-to-text R@5: 92.3
Text-to-image R@1: 56.5
Text-to-image R@10: 88.8
Text-to-image R@5: 81.6
zero-shot-cross-modal-retrieval-on-flickr30kM2-Encoder
Image-to-text R@1: 91.2
Image-to-text R@10: 99.6
Image-to-text R@5: 99.2
Text-to-image R@1: 92.2
Text-to-image R@10: 99.7
Text-to-image R@5: 99.5
zero-shot-learning-on-imagenet-cn$M^2$-Encoder
Accuracy: 80.7
zero-shot-transfer-image-classification-on-1M2-Encoder
Accuracy (Private): 88.5
Param: 10B

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining | Papers | HyperAI