HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
  Information

Abstract

To effectively exploit the potential of large-scale models, variouspre-training strategies supported by massive data from different sources areproposed, including supervised pre-training, weakly-supervised pre-training,and self-supervised pre-training. It has been proved that combining multiplepre-training strategies and data from various modalities/sources can greatlyboost the training of large-scale models. However, current works adopt amulti-stage pre-training system, where the complex pipeline may increase theuncertainty and instability of the pre-training. It is thus desirable thatthese strategies can be integrated in a single-stage manner. In this paper, wefirst propose a general multi-modal mutual information formula as a unifiedoptimization target and demonstrate that all existing approaches are specialcases of our framework. Under this unified perspective, we propose anall-in-one single-stage pre-training approach, named Maximizing Multi-modalMutual Information Pre-training (M3I Pre-training). Our approach achievesbetter performance than previous pre-training methods on various visionbenchmarks, including ImageNet classification, COCO object detection, LVISlong-tailed object detection, and ADE20k semantic segmentation. Notably, wesuccessfully pre-train a billion-level parameter image backbone and achievestate-of-the-art performance on various benchmarks. Code shall be released athttps://github.com/OpenGVLab/M3I-Pretraining.

Code Repositories

OpenGVLab/M3I-Pretraining
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-imagenetM3I Pre-training (InternImage-H)
Top 1 Accuracy: 89.6%
object-detection-on-cocoM3I Pre-training (InternImage-H)
box mAP: 65.4
object-detection-on-coco-minivalM3I Pre-training (InternImage-H)
box AP: 65.0
object-detection-on-lvis-v1-0-minivalM3I Pre-training (InternImage-H, single-scale)
box AP: 65.8
semantic-segmentation-on-ade20kM3I Pre-training (InternImage-H)
Params (M): 1310
Validation mIoU: 62.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information | Papers | HyperAI