5 months ago

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Liang Youwei ; Ge Chongjian ; Tong Zhan ; Song Yibing ; Wang Jue ; Xie Pengtao

Abstract

Vision Transformers (ViTs) take all the image patches as tokens and constructmulti-head self-attention (MHSA) among them. Complete leverage of these imagetokens brings redundant computations since not all the tokens are attentive inMHSA. Examples include that tokens containing semantically meaningless ordistractive image backgrounds do not positively contribute to the ViTpredictions. In this work, we propose to reorganize image tokens during thefeed-forward process of ViT models, which is integrated into ViT duringtraining. For each forward inference, we identify the attentive image tokensbetween MHSA and FFN (i.e., feed-forward network) modules, which is guided bythe corresponding class token attention. Then, we reorganize image tokens bypreserving attentive image tokens and fusing inattentive ones to expeditesubsequent MHSA and FFN computations. To this end, our method EViT improvesViTs from two perspectives. First, under the same amount of input image tokens,our method reduces MHSA and FFN computation for efficient inference. Forinstance, the inference speed of DeiT-S is increased by 50% while itsrecognition accuracy is decreased by only 0.3% for ImageNet classification.Second, by maintaining the same computational cost, our method empowers ViTs totake more image tokens as input for recognition accuracy improvement, where theimage tokens are from higher resolution images. An example is that we improvethe recognition accuracy of DeiT-S by 1% for ImageNet classification at thesame computational cost of a vanilla DeiT-S. Meanwhile, our method does notintroduce more parameters to ViTs. Experiments on the standard benchmarks showthe effectiveness of our method. The code is available athttps://github.com/youweiliang/evit

Code Repositories

youweiliang/evit

Official

pytorch

Mentioned in GitHub

shiming-chen/zslvit

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
efficient-vits-on-imagenet-1k-with-deit-s	EViT (80%)	GFLOPs: 3.5 Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-deit-s	EViT (90%)	GFLOPs: 4.0 Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-deit-s	EViT (60%)	GFLOPs: 2.6 Top 1 Accuracy: 78.9
efficient-vits-on-imagenet-1k-with-deit-s	EViT (50%)	GFLOPs: 2.3 Top 1 Accuracy: 78.5
efficient-vits-on-imagenet-1k-with-deit-s	EViT (70%)	GFLOPs: 3.0 Top 1 Accuracy: 79.5
efficient-vits-on-imagenet-1k-with-lv-vit-s	EViT (50%)	GFLOPs: 3.9 Top 1 Accuracy: 82.5
efficient-vits-on-imagenet-1k-with-lv-vit-s	EViT (70%)	GFLOPs: 4.7 Top 1 Accuracy: 83.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette