Xiaohua ZhaiJoan PuigcerverAlexander KolesnikovPierre RuyssenCarlos RiquelmeMario LucicJosip DjolongaAndre Susano PintoMaxim NeumannAlexey DosovitskiyLucas BeyerOlivier BachemMichael TschannenMarcin MichalskiOlivier BousquetSylvain GellyNeil Houlsby

摘要
表示学习有望在无需昂贵标注数据集的情况下,推动深度学习在视觉任务长尾领域的应用。然而,缺乏统一的通用视觉表示评估标准,严重制约了该领域的进展。现有的主流评估协议往往过于受限(如线性分类)、多样性不足(如仅依赖ImageNet、CIFAR、Pascal-VOC等数据集),或与表示质量的相关性较弱(如ELBO、重构误差)。为此,我们提出了视觉任务适应基准(Visual Task Adaptation Benchmark,简称VTAB),其核心思想是:优秀的表示应能以少量样本快速适应多样且未见过的任务。基于VTAB,我们对多种广泛使用的公开表示学习算法进行了大规模系统性研究。研究中,我们严格控制了模型架构和调优预算等混杂因素。通过该基准,我们深入探讨了若干关键问题:ImageNet预训练表示在标准自然图像数据集之外的表现如何?生成式与判别式模型所学习的表示有何差异?自监督学习在多大程度上可替代人工标注?当前我们距离实现通用视觉表示还有多远?
代码仓库
google-research/task_adaptation
官方
tf
GitHub 中提及
facebookresearch/vissl
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-vtab-1k-1 | SelfSup-RelativePatchLoc-ResNet50 | Top-1 Accuracy: 50.8 |
| image-classification-on-vtab-1k-1 | BigBiGAN-ResNet50 | Top-1 Accuracy: 59.1 |
| image-classification-on-vtab-1k-1 | ResNet50-LargeHyperSweep | Top-1 Accuracy: 59.2 |
| image-classification-on-vtab-1k-1 | SelfSup-Rotation-ResNet50 | Top-1 Accuracy: 59.5 |
| image-classification-on-vtab-1k-1 | Conditional-BigGAN | Top-1 Accuracy: 35.3 |
| image-classification-on-vtab-1k-1 | SelfSup-Jigsaw-ResNet50 | Top-1 Accuracy: 51.1 |
| image-classification-on-vtab-1k-1 | ImageNet-ResNet50-LargeHyperSweep | Top-1 Accuracy: 71.2 |
| image-classification-on-vtab-1k-1 | ResNet50 | Top-1 Accuracy: 42.1 |
| image-classification-on-vtab-1k-1 | S4L-10%-Exemplar-ResNet50 | Top-1 Accuracy: 63.9 |
| image-classification-on-vtab-1k-1 | SelfSup-Exemplar-ResNet50 | Top-1 Accuracy: 57.5 |
| image-classification-on-vtab-1k-1 | VAE | Top-1 Accuracy: 37.5 |
| image-classification-on-vtab-1k-1 | ImageNet-10%-ResNet50 | Top-1 Accuracy: 61.6 |
| image-classification-on-vtab-1k-1 | S4L-Rotation-ResNet50-LargeHyperSweep | Top-1 Accuracy: 71.5 |
| image-classification-on-vtab-1k-1 | WAE-UKL | Top-1 Accuracy: 31.0 |
| image-classification-on-vtab-1k-1 | WAE-GAN | Top-1 Accuracy: 32.0 |
| image-classification-on-vtab-1k-1 | ImageNet-ResNet50 | Top-1 Accuracy: 65.6 |
| image-classification-on-vtab-1k-1 | S4L-Exemplar-ResNet50 | Top-1 Accuracy: 67.0 |
| image-classification-on-vtab-1k-1 | WAE-MMD | Top-1 Accuracy: 37.3 |
| image-classification-on-vtab-1k-1 | S4L-Exemplar-ResNet50-LargeHyperSweep | Top-1 Accuracy: 72.7 |
| image-classification-on-vtab-1k-1 | Unconditional-BigGAN-ResNet50 | Top-1 Accuracy: 44.0 |
| image-classification-on-vtab-1k-1 | S4L-10%-Rotation-ResNet50 | Top-1 Accuracy: 64.8 |
| image-classification-on-vtab-1k-1 | S4L-Rotation-ResNet50 | Top-1 Accuracy: 67.5 |