5 months ago

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Rao Yongming ; Zhao Wenliang ; Liu Benlin ; Lu Jiwen ; Zhou Jie ; Hsieh Cho-Jui

Abstract

Attention is sparse in vision transformers. We observe the final predictionin vision transformers is only based on a subset of most informative tokens,which is sufficient for accurate image recognition. Based on this observation,we propose a dynamic token sparsification framework to prune redundant tokensprogressively and dynamically based on the input. Specifically, we devise alightweight prediction module to estimate the importance score of each tokengiven the current features. The module is added to different layers to pruneredundant tokens hierarchically. To optimize the prediction module in anend-to-end manner, we propose an attention masking strategy to differentiablyprune a token by blocking its interactions with other tokens. Benefiting fromthe nature of self-attention, the unstructured sparse tokens are still hardwarefriendly, which makes our framework easy to achieve actual speed-up. Byhierarchically pruning 66% of the input tokens, our method greatly reduces31%~37% FLOPs and improves the throughput by over 40% while the drop ofaccuracy is within 0.5% for various vision transformers. Equipped with thedynamic token sparsification framework, DynamicViT models can achieve verycompetitive complexity/accuracy trade-offs compared to state-of-the-art CNNsand vision transformers on ImageNet. Code is available athttps://github.com/raoyongming/DynamicViT

Code Repositories

vision-sjtu/quadmamba

pytorch

Mentioned in GitHub

raoyongming/DynamicViT

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
efficient-vits-on-imagenet-1k-with-deit-s	DynamicViT (70%)	GFLOPs: 2.9 Top 1 Accuracy: 79.3
efficient-vits-on-imagenet-1k-with-deit-s	DynamicViT (90%)	GFLOPs: 4.0 Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-deit-s	DynamicViT (80%)	GFLOPs: 3.4 Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-lv-vit-s	DynamicViT (70%)	GFLOPs: 4.6 Top 1 Accuracy: 83.0
efficient-vits-on-imagenet-1k-with-lv-vit-s	DynamicViT (80%)	GFLOPs: 5.1 Top 1 Accuracy: 83.2
efficient-vits-on-imagenet-1k-with-lv-vit-s	DynamicViT (90%)	GFLOPs: 5.8 Top 1 Accuracy: 83.3
image-classification-on-imagenet	DynamicViT-LV-M/0.8	Hardware Burden: Number of params: 57.1M Operations per network pass: Top 1 Accuracy: 83.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette