
摘要
尽管自监督方法在使用残差网络(ResNet)进行表征学习方面取得了近期进展,但在ImageNet分类基准上的表现仍逊于有监督学习,这限制了其在性能要求较高的场景中的应用。基于ReLIC [Mitrovic et al., 2021] 提出的先前理论洞察,我们进一步引入了额外的归纳偏置(inductive biases)至自监督学习框架中。为此,我们提出了一种新的自监督表征学习方法——ReLICv2,该方法结合了显式的不变性损失(explicit invariance loss)与在多种合理构建的数据视图上设计的对比学习目标,以避免学习到虚假相关性,从而获得更具信息量的表征。在ResNet50上进行线性评估时,ReLICv2在ImageNet上达到了77.1%的Top-1准确率,相较于此前的最先进方法实现了绝对提升+1.5%;在更大规模的ResNet模型上,ReLICv2最高可达到80.6%的准确率,较以往自监督方法的提升幅度最高达+2.3%。尤为突出的是,ReLICv2是首个在一系列ResNet架构上,通过完全一致的对比设置(like-for-like comparison)持续超越有监督基线的无监督表征学习方法。此外,利用ReLICv2所学习到的表征在鲁棒性和可迁移性方面均优于以往方法,在图像分类与语义分割任务中均展现出更强的分布外(out-of-distribution)泛化能力。最后,我们还表明,尽管采用ResNet作为编码器,ReLICv2的性能仍可与当前最先进的自监督视觉Transformer模型相媲美。
代码仓库
google-deepmind/relicv2
官方
jax
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-objectnet | SimCLR | Top-1 Accuracy: 14.6 |
| image-classification-on-objectnet | RELICv2 | Top-1 Accuracy: 25.9 |
| image-classification-on-objectnet | RELIC | Top-1 Accuracy: 23.8 |
| image-classification-on-objectnet | BYOL | Top-1 Accuracy: 23 |
| self-supervised-image-classification-on | ReLICv2 (ResNet101) | Number of Params: 44M Top 1 Accuracy: 78.7% |
| self-supervised-image-classification-on | ReLICv2 (ResNet-200 x2) | Number of Params: 250M Top 1 Accuracy: 80.6% |
| self-supervised-image-classification-on | ReLICv2 (ResNet-50) | Number of Params: 25M Top 1 Accuracy: 77.1% |
| self-supervised-image-classification-on | ReLICv2 (ResNet200) | Number of Params: 63M Top 1 Accuracy: 79.8% |
| self-supervised-image-classification-on | ReLICv2 (ResNet-50 4x) | Number of Params: 375M Top 1 Accuracy: 79.4% |
| self-supervised-image-classification-on | ReLICv2 (ResNet152) | Number of Params: 58M Top 1 Accuracy: 79.3% |
| self-supervised-image-classification-on | ReLICv2 (ResNet-50 x2) | Number of Params: 94M Top 1 Accuracy: 79% |
| semantic-segmentation-on-cityscapes-val | BYOL | mIoU: 74.6 |
| semantic-segmentation-on-cityscapes-val | ReLICv2 | mIoU: 75.2 |
| semantic-segmentation-on-pascal-voc-2012-val | ReLICv2 | mIoU: 77.9% |
| semantic-segmentation-on-pascal-voc-2012-val | BYOL | mIoU: 75.7% |
| semantic-segmentation-on-pascal-voc-2012-val | DetCon | mIoU: 77.3% |
| semi-supervised-image-classification-on-1 | RELICv2 | Top 1 Accuracy: 58.1% Top 5 Accuracy: 81.3 |
| semi-supervised-image-classification-on-2 | RELICv2 (ResNet-50) | Top 1 Accuracy: 72.4% Top 5 Accuracy: 91.2% |