3 months ago

Pyramid Adversarial Training Improves ViT Performance

Charles Herrmann Kyle Sargent Lu Jiang Ramin Zabih Huiwen Chang Ce Liu Dilip Krishnan Deqing Sun

Abstract

Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). One such data augmentation technique is adversarial training (AT); however, many prior works have shown that this often results in poor clean accuracy. In this work, we present pyramid adversarial training (PyramidAT), a simple and effective technique to improve ViT's overall performance. We pair it with a "matched" Dropout and stochastic depth regularization, which adopts the same Dropout and stochastic depth configuration for the clean and adversarial samples. Similar to the improvements on CNNs by AdvProp (not directly applicable to ViT), our pyramid adversarial training breaks the trade-off between in-distribution accuracy and out-of-distribution robustness for ViT and related architectures. It leads to 1.82% absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data, while simultaneously boosting performance on 7 ImageNet robustness metrics, by absolute numbers ranging from 1.76% to 15.68%. We set a new state-of-the-art for ImageNet-C (41.42 mCE), ImageNet-R (53.92%), and ImageNet-Sketch (41.04%) without extra data, using only the ViT-B/16 backbone and our pyramid adversarial training. Our code is publicly available at pyramidat.github.io.

Code Repositories

google-research/scenic/tree/main/scenic/projects/adversarialtraining

Official

jax

Benchmarks

Benchmark	Methodology	Metrics
domain-generalization-on-imagenet-a	Pyramid Adversarial Training Improves ViT (384x384)	Top-1 accuracy %: 36.41
domain-generalization-on-imagenet-a	Pyramid Adversarial Training Improves ViT (Im21k)	Top-1 accuracy %: 62.44
domain-generalization-on-imagenet-c	Pyramid Adversarial Training Improves ViT	mean Corruption Error (mCE): 41.42
domain-generalization-on-imagenet-c	Pyramid Adversarial Training Improves ViT (Im21k)	Number of params: 87M mean Corruption Error (mCE): 36.80
domain-generalization-on-imagenet-r	Pyramid Adversarial Training Improves ViT (Im21k)	Top-1 Error Rate: 42.16
domain-generalization-on-imagenet-r	Pyramid Adversarial Training Improves ViT	Top-1 Error Rate: 46.08
domain-generalization-on-imagenet-sketch	Pyramid Adversarial Training Improves ViT	Top-1 accuracy: 41.04
domain-generalization-on-imagenet-sketch	Pyramid Adversarial Training Improves ViT (Im21k)	Top-1 accuracy: 46.03
image-classification-on-objectnet	RegViT (RandAug)	Top-1 Accuracy: 29.3
image-classification-on-objectnet	MLP-Mixer + Pixel	Top-1 Accuracy: 24.75
image-classification-on-objectnet	Discrete ViT	Top-1 Accuracy: 29.95
image-classification-on-objectnet	RegViT (RandAug) + Adv Pixel	Top-1 Accuracy: 30.11
image-classification-on-objectnet	MLP-Mixer	Top-1 Accuracy: 25.9
image-classification-on-objectnet	RegViT (RandAug) + Random Pixel	Top-1 Accuracy: 28.72
image-classification-on-objectnet	RegViT (RandAug) + Adv Pyramid	Top-1 Accuracy: 32.92
image-classification-on-objectnet	RegViT on 384x384 + Random Pyramid	Top-1 Accuracy: 34.83
image-classification-on-objectnet	RegViT (RandAug) + Random Pyramid	Top-1 Accuracy: 29.41
image-classification-on-objectnet	Discrete ViT + Pixel	Top-1 Accuracy: 30.98
image-classification-on-objectnet	RegViT on 384x384 + Random Pixel	Top-1 Accuracy: 34.12
image-classification-on-objectnet	ViT	Top-1 Accuracy: 17.36
image-classification-on-objectnet	ViT + MixUp	Top-1 Accuracy: 25.65
image-classification-on-objectnet	ViT-B/16 (512x512) + Pyramid	Top-1 Accuracy: 49.39
image-classification-on-objectnet	MLP-Mixer + Pyramid	Top-1 Accuracy: 28.6
image-classification-on-objectnet	Discrete ViT + Pyramid	Top-1 Accuracy: 30.28
image-classification-on-objectnet	ViT-B/16 (512x512)	Top-1 Accuracy: 46.68
image-classification-on-objectnet	RegViT on 384x384 + Adv Pixel	Top-1 Accuracy: 37.41
image-classification-on-objectnet	RegViT on 384x384	Top-1 Accuracy: 35.59
image-classification-on-objectnet	ViT-B/16 (512x512) + Pixel	Top-1 Accuracy: 47.53
image-classification-on-objectnet	ViT + CutMix	Top-1 Accuracy: 21.61
image-classification-on-objectnet	RegViT on 384x384 + Adv Pyramid	Top-1 Accuracy: 39.79

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette