Command Palette
Search for a command to run...
Training data-efficient image transformers & distillation through attention
Touvron Hugo ; Cord Matthieu ; Douze Matthijs ; Massa Francisco ; Sablayrolles Alexandre ; Jégou Hervé

Abstract
Recently, neural networks purely based on attention were shown to addressimage understanding tasks such as image classification. However, these visualtransformers are pre-trained with hundreds of millions of images using anexpensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer bytraining on Imagenet only. We train them on a single computer in less than 3days. Our reference vision transformer (86M parameters) achieves top-1 accuracyof 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific totransformers. It relies on a distillation token ensuring that the studentlearns from the teacher through attention. We show the interest of thistoken-based distillation, especially when using a convnet as a teacher. Thisleads us to report results competitive with convnets for both Imagenet (wherewe obtain up to 85.2% accuracy) and when transferring to other tasks. We shareour code and models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | DeiT-B | Accuracy: 90.32% Parameters: 87M |
| document-layout-analysis-on-publaynet-val | DeiT-B | Figure: 0.957 List: 0.921 Overall: 0.932 Table: 0.972 Text: 0.934 Title: 0.874 |
| efficient-vits-on-imagenet-1k-with-deit-s | Base (DeiT-S) | GFLOPs: 4.6 Top 1 Accuracy: 79.8 |
| efficient-vits-on-imagenet-1k-with-deit-t | Base (DeiT-T) | GFLOPs: 1.2 Top 1 Accuracy: 72.2 |
| fine-grained-image-classification-on-oxford | DeiT-B | Accuracy: 98.8% PARAMS: 86M |
| fine-grained-image-classification-on-stanford | DeiT-B | Accuracy: 93.3% PARAMS: 86M |
| image-classification-on-cifar-10 | DeiT-B | Percentage correct: 99.1 |
| image-classification-on-cifar-100 | DeiT-B | PARAMS: 86M Percentage correct: 90.8 |
| image-classification-on-flowers-102 | DeiT-B | Accuracy: 98.8% PARAMS: 86M |
| image-classification-on-imagenet | DeiT-B | Number of params: 86M Top 1 Accuracy: 84.2% |
| image-classification-on-imagenet | DeiT-B 384 | Hardware Burden: Number of params: 87M Operations per network pass: Top 1 Accuracy: 85.2% |
| image-classification-on-imagenet | DeiT-B | Number of params: 5M Top 1 Accuracy: 76.6% |
| image-classification-on-imagenet | DeiT-B | Number of params: 22M Top 1 Accuracy: 82.6% |
| image-classification-on-imagenet-real | DeiT-Ti | Accuracy: 82.1% Params: 5M |
| image-classification-on-imagenet-real | DeiT-B | Accuracy: 88.7% Params: 86M |
| image-classification-on-imagenet-real | DeiT-S | Accuracy: 86.8% Params: 22M |
| image-classification-on-imagenet-real | DeiT-B-384 | Accuracy: 89.3% Params: 86M |
| image-classification-on-inaturalist-2018 | DeiT-B | Top-1 Accuracy: 79.5% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.