
摘要
基础视觉-语言模型已经实现了预训练表示在广泛下游任务中的显著零样本迁移能力。然而,为了解决新任务,零样本迁移仍然需要人类指导来定义数据中出现的视觉类别。本文展示了当在不同基础模型的表示空间中搜索能够诱导最大间隔分类器的数据集标签时,完全无监督的迁移能力会自然涌现。我们提出了TURTLE,一种完全无监督的方法,该方法有效地利用这一指导原则,在没有任何监督和任务特定表示学习的情况下揭示下游数据集的潜在标签。我们在包含26个数据集的多样化基准测试套件上评估了TURTLE,并证明它达到了新的无监督性能最先进水平。此外,尽管TURTLE是完全无监督的,但在多个数据集上其性能超过了零样本迁移基线。特别是,通过使用相同的表示空间(涵盖多种架构和模型大小),TURTLE在26个数据集上的平均性能与CLIP零样本迁移相当。通过利用两个基础模型的表示空间来引导对潜在标签的搜索,TURTLE不仅超越了零样本迁移基线,还超越了无监督提示调优基线,展示了无监督迁移令人惊讶的强大能力和有效性。
代码仓库
mlbio-epfl/turtle
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-clustering-on-birdsnap | TURTLE (CLIP + DINOv2) | Accuracy: 68.1 |
| image-clustering-on-caltech-101 | TURTLE (CLIP + DINOv2) | Accuracy: 89.8 |
| image-clustering-on-cifar-10 | TURTLE (CLIP + DINOv2) | ARI: 0.989 Accuracy: 0.995 NMI: 0.985 |
| image-clustering-on-cifar-100 | TURTLE (CLIP + DINOv2) | ARI: 0.834 Accuracy: 0.898 NMI: 0.915 |
| image-clustering-on-clevr-counts | TURTLE (CLIP + DINOv2) | Accuracy: 24.0 |
| image-clustering-on-country211 | TURTLE (CLIP + DINOv2) | Accuracy: 11.1 |
| image-clustering-on-dtd | TURTLE (CLIP + DINOv2) | Accuracy: 57.3 |
| image-clustering-on-eurosat | TURTLE (CLIP + DINOv2) | Accuracy: 96.6 |
| image-clustering-on-fer2013 | TURTLE (CLIP + DINOv2) | Accuracy: 36.2 |
| image-clustering-on-fgvc-aircraft | TURTLE (CLIP + DINOv2) | Accuracy: 36.5 |
| image-clustering-on-flowers-102 | TURTLE (CLIP + DINOv2) | Accuracy: 99.6 |
| image-clustering-on-food-101 | TURTLE (CLIP + DINOv2) | Accuracy: 92.2 |
| image-clustering-on-gtsrb | TURTLE (CLIP + DINOv2) | Accuracy: 48.4 |
| image-clustering-on-hateful-memes | TURTLE (CLIP + DINOv2) | Accuracy: 54.2 |
| image-clustering-on-imagenet | TURTLE (CLIP + DINOv2) | ARI: 62.5 Accuracy: 72.9 NMI: 88.2 |
| image-clustering-on-kinetics-700 | TURTLE (CLIP + DINOv2) | Accuracy: 43.0 |
| image-clustering-on-kitti | TURTLE (CLIP + DINOv2) | Accuracy: 39.4 |
| image-clustering-on-mnist | TURTLE (CLIP + DINOv2) | Accuracy: 97.8 |
| image-clustering-on-oxford-iiit-pets | TURTLE (CLIP + DINOv2) | Accuracy: 92.3 |
| image-clustering-on-pcam | TURTLE (CLIP + DINOv2) | Accuracy: 52.0 |
| image-clustering-on-rendered-sst2 | TURTLE (CLIP + DINOv2) | Accuracy: 51.6 |
| image-clustering-on-resisc45 | TURTLE (CLIP + DINOv2) | Accuracy: 89.6 |
| image-clustering-on-stanford-cars | TURTLE (CLIP + DINOv2) | Accuracy: 0.646 |
| image-clustering-on-stl-10 | TURTLE (CLIP + DINOv2) | ARI: 0.994 Accuracy: 0.997 NMI: 0.993 |
| image-clustering-on-sun397 | TURTLE (CLIP + DINOv2) | Accuracy: 67.9 |
| image-clustering-on-ucf101 | TURTLE (CLIP + DINOv2) | Accuracy: 82.3 |
| unsupervised-image-classification-on-cifar-10 | TURTLE (CLIP + DINOv2) | Accuracy: 99.5 |
| unsupervised-image-classification-on-imagenet | TURTLE (CLIP + DINOv2) | ARI: 62.5 Accuracy (%): 72.9 |
| unsupervised-image-classification-on-mnist | TURTLE (CLIP + DINOv2) | Accuracy: 97.8 |
| unsupervised-image-classification-on-stl-10 | TURTLE (CLIP + DINOv2) | Accuracy: 99.7 |