5 months ago

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Jiang Zihang ; Hou Qibin ; Yuan Li ; Zhou Daquan ; Shi Yujun ; Jin Xiaojie ; Wang Anran ; Feng Jiashi

Abstract

In this paper, we present token labeling -- a new training objective fortraining high-performance vision transformers (ViTs). Different from thestandard training objective of ViTs that computes the classification loss on anadditional trainable class token, our proposed one takes advantage of all theimage patch tokens to compute the training loss in a dense manner.Specifically, token labeling reformulates the image classification problem intomultiple token-level recognition problems and assigns each patch token with anindividual location-specific supervision generated by a machine annotator.Experiments show that token labeling can clearly and consistently improve theperformance of various ViT models across a wide spectrum. For a visiontransformer with 26M learnable parameters serving as an example, with tokenlabeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The resultcan be further increased to 86.4% by slightly scaling the model size up to150M, delivering the minimal-sized model among previous models (250M+) reaching86%. We also show that token labeling can clearly improve the generalizationcapability of the pre-trained models on downstream tasks with dense prediction,such as semantic segmentation. Our code and all the training details will bemade publicly available at https://github.com/zihangJiang/TokenLabeling.

Code Repositories

sail-sg/dualformer

pytorch

Mentioned in GitHub

naver-ai/vidt

pytorch

Mentioned in GitHub

PaddlePaddle/PASSL

paddle

zhoudaquan/Refiner_ViT

pytorch

Mentioned in GitHub

catalpaaa/demansia

pytorch

Mentioned in GitHub

zihangJiang/TokenLabeling

Official

pytorch

Mentioned in GitHub

flytocc/TokenLabeling-paddle

paddle

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
efficient-vits-on-imagenet-1k-with-lv-vit-s	Base (LV-ViT-S)	GFLOPs: 6.6 Top 1 Accuracy: 83.3
image-classification-on-imagenet	LV-ViT-S	GFLOPs: 6.6 Number of params: 26M Top 1 Accuracy: 83.3%
image-classification-on-imagenet	LV-ViT-M	GFLOPs: 16 Number of params: 56M Top 1 Accuracy: 84.1%
image-classification-on-imagenet	LV-ViT-L	GFLOPs: 214.8 Number of params: 151M Top 1 Accuracy: 86.4%
semantic-segmentation-on-ade20k	LV-ViT-L (UperNet, MS)	Params (M): 209 Validation mIoU: 51.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

All Tokens Matter: Token Labeling for Training Better Vision Transformers

Jiang Zihang ; Hou Qibin ; Yuan Li ; Zhou Daquan ; Shi Yujun ; Jin Xiaojie ; Wang Anran ; Feng Jiashi

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters