5 months ago

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

Su Weijie ; Zhu Xizhou ; Tao Chenxin ; Lu Lewei ; Li Bin ; Huang Gao ; Qiao Yu ; Wang Xiaogang ; Zhou Jie ; Dai

Abstract

To effectively exploit the potential of large-scale models, variouspre-training strategies supported by massive data from different sources areproposed, including supervised pre-training, weakly-supervised pre-training,and self-supervised pre-training. It has been proved that combining multiplepre-training strategies and data from various modalities/sources can greatlyboost the training of large-scale models. However, current works adopt amulti-stage pre-training system, where the complex pipeline may increase theuncertainty and instability of the pre-training. It is thus desirable thatthese strategies can be integrated in a single-stage manner. In this paper, wefirst propose a general multi-modal mutual information formula as a unifiedoptimization target and demonstrate that all existing approaches are specialcases of our framework. Under this unified perspective, we propose anall-in-one single-stage pre-training approach, named Maximizing Multi-modalMutual Information Pre-training (M3I Pre-training). Our approach achievesbetter performance than previous pre-training methods on various visionbenchmarks, including ImageNet classification, COCO object detection, LVISlong-tailed object detection, and ADE20k semantic segmentation. Notably, wesuccessfully pre-train a billion-level parameter image backbone and achievestate-of-the-art performance on various benchmarks. Code shall be released athttps://github.com/OpenGVLab/M3I-Pretraining.

Code Repositories

OpenGVLab/M3I-Pretraining

Official

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
image-classification-on-imagenet	M3I Pre-training (InternImage-H)	Top 1 Accuracy: 89.6%
object-detection-on-coco	M3I Pre-training (InternImage-H)	box mAP: 65.4
object-detection-on-coco-minival	M3I Pre-training (InternImage-H)	box AP: 65.0
object-detection-on-lvis-v1-0-minival	M3I Pre-training (InternImage-H, single-scale)	box AP: 65.8
semantic-segmentation-on-ade20k	M3I Pre-training (InternImage-H)	Params (M): 1310 Validation mIoU: 62.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette