8 months ago

Abstract

Although vision transformers (ViTs) have shown promising results in variouscomputer vision tasks recently, their high computational cost limits theirpractical applications. Previous approaches that prune redundant tokens havedemonstrated a good trade-off between performance and computation costs.Nevertheless, errors caused by pruning strategies can lead to significantinformation loss. Our quantitative experiments reveal that the impact of prunedtokens on performance should be noticeable. To address this issue, we propose anovel joint Token Pruning & Squeezing module (TPS) for compressing visiontransformers with higher efficiency. Firstly, TPS adopts pruning to get thereserved and pruned subsets. Secondly, TPS squeezes the information of prunedtokens into partial reserved tokens via the unidirectional nearest-neighbormatching and similarity-based fusing steps. Compared to state-of-the-artmethods, our approach outperforms them under all token pruning intensities.Especially while shrinking DeiT-tiny&small computational budgets to 35%, itimproves the accuracy by 1%-6% compared with baselines on ImageNetclassification. The proposed method can accelerate the throughput of DeiT-smallbeyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experimentson various transformers demonstrate the effectiveness of our method, whileanalysis experiments prove our higher robustness to the errors of the tokenpruning policy. Code is available athttps://github.com/megvii-research/TPS-CVPR2023.

Source PDF View Code