HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Rao Yongming ; Zhao Wenliang ; Liu Benlin ; Lu Jiwen ; Zhou Jie ; Hsieh Cho-Jui

DynamicViT: Efficient Vision Transformers with Dynamic Token
  Sparsification

Abstract

Attention is sparse in vision transformers. We observe the final predictionin vision transformers is only based on a subset of most informative tokens,which is sufficient for accurate image recognition. Based on this observation,we propose a dynamic token sparsification framework to prune redundant tokensprogressively and dynamically based on the input. Specifically, we devise alightweight prediction module to estimate the importance score of each tokengiven the current features. The module is added to different layers to pruneredundant tokens hierarchically. To optimize the prediction module in anend-to-end manner, we propose an attention masking strategy to differentiablyprune a token by blocking its interactions with other tokens. Benefiting fromthe nature of self-attention, the unstructured sparse tokens are still hardwarefriendly, which makes our framework easy to achieve actual speed-up. Byhierarchically pruning 66% of the input tokens, our method greatly reduces31%~37% FLOPs and improves the throughput by over 40% while the drop ofaccuracy is within 0.5% for various vision transformers. Equipped with thedynamic token sparsification framework, DynamicViT models can achieve verycompetitive complexity/accuracy trade-offs compared to state-of-the-art CNNsand vision transformers on ImageNet. Code is available athttps://github.com/raoyongming/DynamicViT

Code Repositories

vision-sjtu/quadmamba
pytorch
Mentioned in GitHub
raoyongming/DynamicViT
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
efficient-vits-on-imagenet-1k-with-deit-sDynamicViT (70%)
GFLOPs: 2.9
Top 1 Accuracy: 79.3
efficient-vits-on-imagenet-1k-with-deit-sDynamicViT (90%)
GFLOPs: 4.0
Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-deit-sDynamicViT (80%)
GFLOPs: 3.4
Top 1 Accuracy: 79.8
efficient-vits-on-imagenet-1k-with-lv-vit-sDynamicViT (70%)
GFLOPs: 4.6
Top 1 Accuracy: 83.0
efficient-vits-on-imagenet-1k-with-lv-vit-sDynamicViT (80%)
GFLOPs: 5.1
Top 1 Accuracy: 83.2
efficient-vits-on-imagenet-1k-with-lv-vit-sDynamicViT (90%)
GFLOPs: 5.8
Top 1 Accuracy: 83.3
image-classification-on-imagenetDynamicViT-LV-M/0.8
Hardware Burden:
Number of params: 57.1M
Operations per network pass:
Top 1 Accuracy: 83.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification | Papers | HyperAI