Command Palette
Search for a command to run...
Training data-efficient image transformers & distillation through
attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron Matthieu Cord Matthijs Douze Francisco Massa Alexandre Sablayrolles Hervé Jégou
Abstract
Recently, neural networks purely based on attention were shown to addressimage understanding tasks such as image classification. However, these visualtransformers are pre-trained with hundreds of millions of images using anexpensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer bytraining on Imagenet only. We train them on a single computer in less than 3days. Our reference vision transformer (86M parameters) achieves top-1 accuracyof 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific totransformers. It relies on a distillation token ensuring that the studentlearns from the teacher through attention. We show the interest of thistoken-based distillation, especially when using a convnet as a teacher. Thisleads us to report results competitive with convnets for both Imagenet (wherewe obtain up to 85.2% accuracy) and when transferring to other tasks. We shareour code and models.