HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Xiangning Chen Cho-Jui Hsieh Boqing Gong

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Abstract

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. Model checkpoints are available at \url{https://github.com/google-research/vision_transformer}.

Code Repositories

google-research/vision_transformer
Official
jax
Mentioned in GitHub
ttt496/VisionTransformer
jax
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
domain-generalization-on-imagenet-cMixer-B/8-SAM
Top 1 Accuracy: 48.9
domain-generalization-on-imagenet-cResNet-152x2-SAM
Top 1 Accuracy: 55
domain-generalization-on-imagenet-cViT-B/16-SAM
Top 1 Accuracy: 56.5
domain-generalization-on-imagenet-rMixer-B/8-SAM
Top-1 Error Rate: 76.5
domain-generalization-on-imagenet-rResNet-152x2-SAM
Top-1 Error Rate: 71.9
domain-generalization-on-imagenet-rViT-B/16-SAM
Top-1 Error Rate: 73.6
fine-grained-image-classification-on-oxford-2ResNet-50-SAM
Accuracy: 91.6
fine-grained-image-classification-on-oxford-2Mixer-B/16- SAM
Accuracy: 92.5
fine-grained-image-classification-on-oxford-2Mixer-S/16- SAM
Accuracy: 88.7
fine-grained-image-classification-on-oxford-2ViT-B/16- SAM
Accuracy: 93.1
fine-grained-image-classification-on-oxford-2ViT-S/16- SAM
Accuracy: 92.9
fine-grained-image-classification-on-oxford-2ResNet-152-SAM
Accuracy: 93.3
image-classification-on-cifar-10ResNet-50-SAM
Percentage correct: 97.4
image-classification-on-cifar-10Mixer-S/16- SAM
Percentage correct: 96.1
image-classification-on-cifar-10ViT-S/16- SAM
Percentage correct: 98.2
image-classification-on-cifar-10ViT-B/16- SAM
Percentage correct: 98.6
image-classification-on-cifar-10ResNet-152-SAM
Percentage correct: 98.2
image-classification-on-cifar-10Mixer-B/16- SAM
Percentage correct: 97.8
image-classification-on-cifar-100Mixer-B/16- SAM
Percentage correct: 86.4
image-classification-on-cifar-100ViT-B/16- SAM
Percentage correct: 89.1
image-classification-on-cifar-100ResNet-50-SAM
Percentage correct: 85.2
image-classification-on-cifar-100ViT-S/16- SAM
Percentage correct: 87.6
image-classification-on-cifar-100Mixer-S/16- SAM
Percentage correct: 82.4
image-classification-on-flowers-102Mixer-S/16- SAM
Accuracy: 87.9
image-classification-on-flowers-102ResNet-152-SAM
Accuracy: 91.1
image-classification-on-flowers-102ViT-S/16- SAM
Accuracy: 91.5
image-classification-on-flowers-102ViT-B/16- SAM
Accuracy: 91.8
image-classification-on-flowers-102Mixer-B/16- SAM
Accuracy: 90
image-classification-on-flowers-102ResNet-50-SAM
Accuracy: 90
image-classification-on-imagenetViT-B/16-SAM
Number of params: 87M
Top 1 Accuracy: 79.9%
image-classification-on-imagenetResNet-152x2-SAM
Number of params: 236M
Top 1 Accuracy: 81.1%
image-classification-on-imagenetMixer-B/8-SAM
Number of params: 64M
Top 1 Accuracy: 79%
image-classification-on-imagenet-realResNet-152x2-SAM
Accuracy: 86.4%
image-classification-on-imagenet-realViT-B/16-SAM
Accuracy: 85.2%
image-classification-on-imagenet-realMixer-B/8-SAM
Accuracy: 84.4%
image-classification-on-imagenet-v2Mixer-B/8-SAM
Top 1 Accuracy: 65.5
image-classification-on-imagenet-v2ViT-B/16-SAM
Top 1 Accuracy: 67.5
image-classification-on-imagenet-v2ResNet-152x2-SAM
Top 1 Accuracy: 69.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations | Papers | HyperAI