Command Palette
Search for a command to run...
Bolya Daniel ; Fu Cheng-Yang ; Dai Xiaoliang ; Zhang Peizhao ; Feichtenhofer Christoph ; Hoffman Judy

Abstract
We introduce Token Merging (ToMe), a simple method to increase the throughputof existing ViT models without needing to train. ToMe gradually combinessimilar tokens in a transformer using a general and light-weight matchingalgorithm that is as fast as pruning while being more accurate. Off-the-shelf,ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3%accuracy drop in each case. ToMe can also easily be applied during training,improving in practice training speed up to 2x for MAE fine-tuning on video.Training with ToMe further minimizes accuracy drop, leading to 2x thethroughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we findthat ToMe merges object parts into one token, even over multiple frames ofvideo. Overall, ToMe's accuracy and speed are competitive with state-of-the-arton images, video, and audio.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| efficient-vits-on-imagenet-1k-with-deit-s | ToMe ($r=8$) | GFLOPs: 3.4 Top 1 Accuracy: 79.7 |
| efficient-vits-on-imagenet-1k-with-deit-s | ToMe ($r=13$) | GFLOPs: 2.7 Top 1 Accuracy: 79.4 |
| efficient-vits-on-imagenet-1k-with-deit-s | ToMe ($r=16$) | GFLOPs: 2.3 Top 1 Accuracy: 79.1 |
| efficient-vits-on-imagenet-1k-with-deit-t | ToMe ($r=16$) | GFLOPs: 0.6 Top 1 Accuracy: 70.7 |
| efficient-vits-on-imagenet-1k-with-deit-t | ToMe ($r=12$) | GFLOPs: 0.8 Top 1 Accuracy: 71.4 |
| efficient-vits-on-imagenet-1k-with-deit-t | ToMe ($r=8$) | GFLOPs: 0.9 Top 1 Accuracy: 71.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.