HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multimodal Autoregressive Pre-training of Large Vision Encoders

Multimodal Autoregressive Pre-training of Large Vision Encoders

Abstract

We introduce a novel method for pre-training of large-scale vision encoders.Building on recent advancements in autoregressive pre-training of visionmodels, we extend this framework to a multimodal setting, i.e., images andtext. In this paper, we present AIMV2, a family of generalist vision encoderscharacterized by a straightforward pre-training process, scalability, andremarkable performance across a range of downstream tasks. This is achieved bypairing the vision encoder with a multimodal decoder that autoregressivelygenerates raw image patches and text tokens. Our encoders excel not only inmultimodal evaluations but also in vision benchmarks such as localization,grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistentlyoutperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) inmultimodal image understanding across diverse settings.

Code Repositories

apple/ml-aim
Official
jax
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-imagenetAIMv2-3B
Top 1 Accuracy: 88.5%
image-classification-on-imagenetAIMv2-2B
Number of params: 2700M
image-classification-on-imagenetAIMv2-3B (448 res)
Top 1 Accuracy: 89.5%
image-classification-on-imagenetAIMv2-L
Number of params: 300M
Top 1 Accuracy: 86.6%
image-classification-on-imagenetAIMv2-1B
Number of params: 1200M
Top 1 Accuracy: 88.1%
image-classification-on-imagenetAIMv2-H
Number of params: 600M
Top 1 Accuracy: 87.5%
image-classification-on-inaturalistAIMv2-1B
Top 1 Accuracy: 79.7
image-classification-on-inaturalistAIMv2-H
Top 1 Accuracy: 77.9
image-classification-on-inaturalistAIMv2-3B
Top 1 Accuracy: 81.5
image-classification-on-inaturalistAIMv2-L
Top 1 Accuracy: 76
image-classification-on-inaturalistAIMv2-3B (448 res)
Top 1 Accuracy: 85.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Autoregressive Pre-training of Large Vision Encoders | Papers | HyperAI