
摘要
掩码自编码器已成为自监督视觉表征学习中广受欢迎的训练范式。这类模型通过随机掩码输入的一部分,并根据目标表征重建被掩码区域来实现学习。本文首次表明,学习良好表征并不需要对目标表征进行精心设计,因为不同的目标表征往往能够诱导出行为相似的模型。基于这一观察,我们提出了一种多阶段掩码蒸馏(masked distillation)流程,并采用随机初始化的模型作为教师网络,从而无需刻意设计目标表征,即可高效训练高容量模型。有趣的是,我们进一步探索了使用更大容量教师模型的可能性,获得了具有显著迁移能力的学生模型。在图像分类、迁移学习、目标检测和语义分割等多种任务上,所提出的基于自举教师的掩码知识蒸馏方法(dBOT)均以显著优势超越了以往的自监督方法。我们希望本研究的发现以及所提出的方法,能够促使研究者重新思考预训练掩码自编码器中目标表征的作用。代码与预训练模型已公开发布于:https://github.com/liuxingbin/dbot。
代码仓库
liuxingbin/dbot
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-imagenet | dBOT ViT-B (CLIP as Teacher) | Top 1 Accuracy: 85.7% |
| image-classification-on-imagenet | dBOT ViT-H (CLIP as Teacher) | Top 1 Accuracy: 88.2% |
| image-classification-on-imagenet | dBOT ViT-L (CLIP as Teacher) | Top 1 Accuracy: 87.8% |
| instance-segmentation-on-coco | dBOT ViT-B (CLIP) | mask AP: 46.2 |
| instance-segmentation-on-coco | dBOT ViT-L (CLIP) | mask AP: 48.8 |
| instance-segmentation-on-coco | dBOT ViT-L | mask AP: 48.3 |
| instance-segmentation-on-coco | dBOT ViT-B | mask AP: 46.3 |
| object-detection-on-coco | dBOT ViT-B (CLIP) | box mAP: 53.6 |
| object-detection-on-coco | dBOT ViT-L (CLIP) | box mAP: 56.8 |
| object-detection-on-coco | dBOT ViT-B | box mAP: 53.5 |
| object-detection-on-coco | dBOT ViT-L | box mAP: 56.1 |
| self-supervised-image-classification-on-1 | dBOT (ViT-H/14) | Number of Params: 632M Top 1 Accuracy: 88.0% |
| semantic-segmentation-on-ade20k | dBOT ViT-B | Validation mIoU: 50.8 |
| semantic-segmentation-on-ade20k | dBOT ViT-L (CLIP) | Validation mIoU: 56.2 |
| semantic-segmentation-on-ade20k | dBOT ViT-L | Validation mIoU: 55.2 |
| semantic-segmentation-on-ade20k | dBOT ViT-B (CLIP) | Validation mIoU: 52.9 |