
摘要
语言Transformer的成功主要归功于预训练任务中的掩码语言建模(MLM),其中文本首先被切分为具有语义意义的片段。在本研究中,我们探讨了掩码图像建模(MIM)并指出了使用具有语义意义的视觉分词器的优势和挑战。我们提出了一种自监督框架iBOT,该框架可以使用在线分词器进行掩码预测。具体而言,我们在掩码补丁令牌上执行自蒸馏,并将教师网络作为在线分词器,同时对类别令牌进行自蒸馏以获取视觉语义。在线分词器与MIM目标联合学习,并且无需预先训练分词器的多阶段训练流程。我们在ImageNet-1K数据集上评估了iBOT的表现,其线性探测准确率达到82.3%,微调准确率达到87.8%。除了在图像分类方面取得的领先结果外,我们还强调了出现的局部语义模式,这有助于模型在常见损坏下获得强大的鲁棒性,并在密集下游任务(如目标检测、实例分割和语义分割)中取得领先的结果。
代码仓库
bytedance/ibot
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| instance-segmentation-on-coco | iBOT (ViT-B/16) | mask AP: 44.2 |
| instance-segmentation-on-coco | iBOT (ViT-S/16) | mask AP: 42.6 |
| object-detection-on-coco | iBOT (ViT-B/16) | box mAP: 51.2 |
| object-detection-on-coco | iBOT (ViT-S/16) | box mAP: 49.4 |
| self-supervised-image-classification-on | iBOT (ViT-L/16) (IN22k) | Number of Params: 307M Top 1 Accuracy: 82.3% |
| self-supervised-image-classification-on | iBOT (ViT-L/16) | Number of Params: 307M Top 1 Accuracy: 81.3% |
| self-supervised-image-classification-on-1 | iBOT (ViT-L/16) | Number of Params: 307M Top 1 Accuracy: 84.8% |
| self-supervised-image-classification-on-1 | iBOT(ViT-L/16) | Number of Params: 307M Top 1 Accuracy: 86.6% |
| self-supervised-image-classification-on-1 | iBOT (ViT-B/16) | Number of Params: 85M Top 1 Accuracy: 84.0% |
| self-supervised-image-classification-on-1 | iBOT(ViT-L/16, 512) | Number of Params: 307M Top 1 Accuracy: 87.8% |
| self-supervised-image-classification-on-1 | iBOT (ViT-B/16) | Number of Params: 85M Top 1 Accuracy: 84.4% |
| semantic-segmentation-on-ade20k | iBOT (ViT-S/16) | Validation mIoU: 45.4 |
| semantic-segmentation-on-ade20k | iBOT (ViT-B/16) (linear head) | Validation mIoU: 38.3 |
| semantic-segmentation-on-ade20k | iBOT (ViT-B/16) | Validation mIoU: 50.0 |
| semi-supervised-image-classification-on-1 | iBOT (ViT-S/16) | Top 1 Accuracy: 61.9% |
| unsupervised-image-classification-on-imagenet | iBOT (ViT-S/16) | ARI: 32.8 Accuracy (%): 43.4 |