5 months ago

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Chen Tianlong ; Cheng Yu ; Gan Zhe ; Yuan Lu ; Zhang Lei ; Wang Zhangyang

Abstract

Vision transformers (ViTs) have recently received explosive popularity, buttheir enormous model sizes and training costs remain daunting. Conventionalpost-training pruning often incurs higher training budgets. In contrast, thispaper aims to trim down both the training memory overhead and the inferencecomplexity, without sacrificing the achievable accuracy. We carry out thefirst-of-its-kind comprehensive exploration, on taking a unified approach ofintegrating sparsity in ViTs "from end to end". Specifically, instead oftraining full ViTs, we dynamically extract and train sparse subnetworks, whilesticking to a fixed small parameter budget. Our approach jointly optimizesmodel parameters and explores connectivity throughout training, ending up withone sparse network as the final output. The approach is seamlessly extendedfrom unstructured to structured sparsity, the latter by considering to guidethe prune-and-grow of self-attention heads inside ViTs. We further co-exploredata and architecture sparsity for additional efficiency gains by plugging in anovel learnable token selector to adaptively determine the currently most vitalpatches. Extensive results on ImageNet with diverse ViT backbones validate theeffectiveness of our proposals which obtain significantly reduced computationalcost and almost unimpaired generalization. Perhaps most surprisingly, we findthat the proposed sparse (co-)training can sometimes improve the ViT accuracyrather than compromising it, making sparsity a tantalizing "free lunch". Forexample, our sparsified DeiT-Small at (5%, 50%) sparsity for (data,architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPsand 4.40% running time savings. Our codes are available athttps://github.com/VITA-Group/SViTE.

Code Repositories

VITA-Group/SViTE

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
efficient-vits-on-imagenet-1k-with-deit-s	S$^2$ViTE	GFLOPs: 3.2 Top 1 Accuracy: 79.2
efficient-vits-on-imagenet-1k-with-deit-t	S$^2$ViTE	GFLOPs: 0.9 Top 1 Accuracy: 70.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette