HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani; Josip Djolonga; Basil Mustafa; Piotr Padlewski; Jonathan Heek; Justin Gilmer; Andreas Steiner; Mathilde Caron; Robert Geirhos; Ibrahim Alabdulmohsin; Rodolphe Jenatton; Lucas Beyer; Michael Tschannen; Anurag Arnab; Xiao Wang; Carlos Riquelme; Matthias Minderer; Joan Puigcerver; Utku Evci; Manoj Kumar; Sjoerd van Steenkiste; Gamaleldin F. Elsayed; Aravindh Mahendran; Fisher Yu; Avital Oliver; Fantine Huot; Jasmijn Bastings; Mark Patrick Collier; Alexey Gritsenko; Vighnesh Birodkar; Cristina Vasconcelos; Yi Tay; Thomas Mensink; Alexander Kolesnikov; Filip Pavetić; Dustin Tran; Thomas Kipf; Mario Lučić; Xiaohua Zhai; Daniel Keysers; Jeremiah Harmsen; Neil Houlsby

Scaling Vision Transformers to 22 Billion Parameters

Abstract

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Code Repositories

lucidrains/flash-cosine-sim-attention
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400ViT-22B
Acc@1: 88.0
image-classification-on-imagenetViT-B/16
Number of params: 86M
Top 1 Accuracy: 88.6%
image-classification-on-imagenetViT-L/16 (384res, distilled from ViT-22B)
Number of params: 307M
Top 1 Accuracy: 89.6%
object-recognition-on-shape-biasViT-22B-384
shape bias: 86.4
object-recognition-on-shape-biasViT-22B-560
shape bias: 83.8
object-recognition-on-shape-biasViT-22B-224
shape bias: 78.0
zero-shot-transfer-image-classification-on-1LiT-22B
Accuracy (Private): 85.9
zero-shot-transfer-image-classification-on-3LiT-22B
Accuracy (Private): 80.9
zero-shot-transfer-image-classification-on-4LiT-22B
Accuracy: 96.0
zero-shot-transfer-image-classification-on-5LiT-22B
Accuracy (Private): 90.1
zero-shot-transfer-image-classification-on-6LiT-22B
Accuracy (Private): 87.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Scaling Vision Transformers to 22 Billion Parameters | Papers | HyperAI