HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Lee Sanghyeok ; Choi Joonmyung ; Kim Hyunwoo J.

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient
  Vision Transformers

Abstract

Vision Transformer (ViT) has emerged as a prominent backbone for computervision. For more efficient ViTs, recent works lessen the quadratic cost of theself-attention layer by pruning or fusing the redundant tokens. However, theseworks faced the speed-accuracy trade-off caused by the loss of information.Here, we argue that token fusion needs to consider diverse relations betweentokens to minimize information loss. In this paper, we propose a Multi-criteriaToken Fusion (MCTF), that gradually fuses the tokens based on multi-criteria(e.g., similarity, informativeness, and size of fused tokens). Further, weutilize the one-step-ahead attention, which is the improved approach to capturethe informativeness of the tokens. By training the model equipped with MCTFusing a token reduction consistency, we achieve the best speed-accuracytrade-off in the image classification (ImageNet1K). Experimental results provethat MCTF consistently surpasses the previous reduction methods with andwithout training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs byabout 44% while improving the performance (+0.5%, and +0.3%) over the basemodel, respectively. We also demonstrate the applicability of MCTF in variousVision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedupwithout performance degradation. Code is available athttps://github.com/mlvlab/MCTF.

Code Repositories

mlvlab/mctf
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
efficient-vits-on-imagenet-1k-with-deit-sMCTF ($r=18$)
GFLOPs: 2.4
Top 1 Accuracy: 79.9
efficient-vits-on-imagenet-1k-with-deit-sMCTF ($r=20$)
GFLOPs: 2.2
Top 1 Accuracy: 79.5
efficient-vits-on-imagenet-1k-with-deit-sMCTF ($r=16$)
GFLOPs: 2.6
Top 1 Accuracy: 80.1
efficient-vits-on-imagenet-1k-with-deit-tMCTF ($r=20$)
GFLOPs: 0.6
Top 1 Accuracy: 71.4
efficient-vits-on-imagenet-1k-with-deit-tMCTF ($r=8$)
GFLOPs: 1.0
Top 1 Accuracy: 72.9
efficient-vits-on-imagenet-1k-with-deit-tMCTF ($r=16$)
GFLOPs: 0.7
Top 1 Accuracy: 72.7
efficient-vits-on-imagenet-1k-with-lv-vit-sMCTF ($r=16$)
GFLOPs: 3.6
Top 1 Accuracy: 82.3
efficient-vits-on-imagenet-1k-with-lv-vit-sMCTF ($r=8$)
GFLOPs: 4.9
Top 1 Accuracy: 83.5
efficient-vits-on-imagenet-1k-with-lv-vit-sMCTF ($r=12$)
GFLOPs: 4.2
Top 1 Accuracy: 83.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers | Papers | HyperAI