HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Liang Youwei ; Ge Chongjian ; Tong Zhan ; Song Yibing ; Wang Jue ; Xie Pengtao

Not All Patches are What You Need: Expediting Vision Transformers via
  Token Reorganizations

Abstract

Vision Transformers (ViTs) take all the image patches as tokens and constructmulti-head self-attention (MHSA) among them. Complete leverage of these imagetokens brings redundant computations since not all the tokens are attentive inMHSA. Examples include that tokens containing semantically meaningless ordistractive image backgrounds do not positively contribute to the ViTpredictions. In this work, we propose to reorganize image tokens during thefeed-forward process of ViT models, which is integrated into ViT duringtraining. For each forward inference, we identify the attentive image tokensbetween MHSA and FFN (i.e., feed-forward network) modules, which is guided bythe corresponding class token attention. Then, we reorganize image tokens bypreserving attentive image tokens and fusing inattentive ones to expeditesubsequent MHSA and FFN computations. To this end, our method EViT improvesViTs from two perspectives. First, under the same amount of input image tokens,our method reduces MHSA and FFN computation for efficient inference. Forinstance, the inference speed of DeiT-S is increased by 50% while itsrecognition accuracy is decreased by only 0.3% for ImageNet classification.Second, by maintaining the same computational cost, our method empowers ViTs totake more image tokens as input for recognition accuracy improvement, where theimage tokens are from higher resolution images. An example is that we improvethe recognition accuracy of DeiT-S by 1% for ImageNet classification at thesame computational cost of a vanilla DeiT-S. Meanwhile, our method does notintroduce more parameters to ViTs. Experiments on the standard benchmarks showthe effectiveness of our method. The code is available athttps://github.com/youweiliang/evit

Code Repositories

youweiliang/evit
Official
pytorch
Mentioned in GitHub
shiming-chen/zslvit
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
efficient-vits-on-imagenet-1k-with-deit-sEViT (80%)
GFLOPs: 3.5
Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-deit-sEViT (90%)
GFLOPs: 4.0
Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-deit-sEViT (60%)
GFLOPs: 2.6
Top 1 Accuracy: 78.9
efficient-vits-on-imagenet-1k-with-deit-sEViT (50%)
GFLOPs: 2.3
Top 1 Accuracy: 78.5
efficient-vits-on-imagenet-1k-with-deit-sEViT (70%)
GFLOPs: 3.0
Top 1 Accuracy: 79.5
efficient-vits-on-imagenet-1k-with-lv-vit-sEViT (50%)
GFLOPs: 3.9
Top 1 Accuracy: 82.5
efficient-vits-on-imagenet-1k-with-lv-vit-sEViT (70%)
GFLOPs: 4.7
Top 1 Accuracy: 83.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations | Papers | HyperAI