Command Palette
Search for a command to run...
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Rao Yongming ; Zhao Wenliang ; Liu Benlin ; Lu Jiwen ; Zhou Jie ; Hsieh Cho-Jui

Abstract
Attention is sparse in vision transformers. We observe the final predictionin vision transformers is only based on a subset of most informative tokens,which is sufficient for accurate image recognition. Based on this observation,we propose a dynamic token sparsification framework to prune redundant tokensprogressively and dynamically based on the input. Specifically, we devise alightweight prediction module to estimate the importance score of each tokengiven the current features. The module is added to different layers to pruneredundant tokens hierarchically. To optimize the prediction module in anend-to-end manner, we propose an attention masking strategy to differentiablyprune a token by blocking its interactions with other tokens. Benefiting fromthe nature of self-attention, the unstructured sparse tokens are still hardwarefriendly, which makes our framework easy to achieve actual speed-up. Byhierarchically pruning 66% of the input tokens, our method greatly reduces31%~37% FLOPs and improves the throughput by over 40% while the drop ofaccuracy is within 0.5% for various vision transformers. Equipped with thedynamic token sparsification framework, DynamicViT models can achieve verycompetitive complexity/accuracy trade-offs compared to state-of-the-art CNNsand vision transformers on ImageNet. Code is available athttps://github.com/raoyongming/DynamicViT
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| efficient-vits-on-imagenet-1k-with-deit-s | DynamicViT (70%) | GFLOPs: 2.9 Top 1 Accuracy: 79.3 |
| efficient-vits-on-imagenet-1k-with-deit-s | DynamicViT (90%) | GFLOPs: 4.0 Top 1 Accuracy: 79.8 |
| efficient-vits-on-imagenet-1k-with-deit-s | DynamicViT (80%) | GFLOPs: 3.4 Top 1 Accuracy: 79.8 |
| efficient-vits-on-imagenet-1k-with-lv-vit-s | DynamicViT (70%) | GFLOPs: 4.6 Top 1 Accuracy: 83.0 |
| efficient-vits-on-imagenet-1k-with-lv-vit-s | DynamicViT (80%) | GFLOPs: 5.1 Top 1 Accuracy: 83.2 |
| efficient-vits-on-imagenet-1k-with-lv-vit-s | DynamicViT (90%) | GFLOPs: 5.8 Top 1 Accuracy: 83.3 |
| image-classification-on-imagenet | DynamicViT-LV-M/0.8 | Hardware Burden: Number of params: 57.1M Operations per network pass: Top 1 Accuracy: 83.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.