
摘要
使用对比损失训练的视觉-语言模型(VLMs)在各种视觉和语言任务中取得了显著进展。然而,对比损失的全局性质使得这些模型主要关注前景对象,而忽略了图像中的其他关键信息,这限制了它们在下游任务中的有效性。为了解决这些挑战,我们提出了COSMOS:一种用于视觉-语言预训练的跨模态自蒸馏方法(CrOSs-MOdality Self-distillation),该方法将一种新颖的文本裁剪策略和跨注意力模块整合到一个自监督学习框架中。我们创建了图像和文本的全局视图和局部视图(即多模态增强),这对于VLMs中的自蒸馏至关重要。此外,我们引入了一个跨注意力模块,使COSMOS能够通过跨模态自蒸馏损失学习全面的跨模态表示。COSMOS在各种零样本下游任务上始终优于先前的强大基线模型,包括检索、分类和语义分割。此外,在视觉感知和上下文理解任务中,它还超越了基于CLIP且在更大数据集上训练的模型。代码可在https://github.com/ExplainableML/cosmos 获取。
代码仓库
ExplainableML/cosmos
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| unsupervised-semantic-segmentation-with-10 | COSMOS ViT-B/16 | mIoU: 31.3 |
| unsupervised-semantic-segmentation-with-3 | COSMOS ViT-B/16 | mIoU: 34.7 |
| unsupervised-semantic-segmentation-with-4 | COSMOS ViT-B/16 | Mean IoU (val): 17.7 |
| unsupervised-semantic-segmentation-with-7 | COSMOS ViT-B/16 | mIoU: 77.7 |
| unsupervised-semantic-segmentation-with-8 | COSMOS ViT-B/16 | mIoU: 33.7 |
| unsupervised-semantic-segmentation-with-9 | COSMOS ViT-B/16 | mIoU: 23.2 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | COSMOS ViT-B/32 | Image-to-text R@1: 64.3 Image-to-text R@10: 92.0 Image-to-text R@5: 86.5 Text-to-image R@1: 48.4 Text-to-image R@10: 82.6 Text-to-image R@5: 74.2 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | COSMOS ViT-B/16 | Image-to-text R@1: 68.0 Image-to-text R@10: 92.5 Image-to-text R@5: 87.8 Text-to-image R@1: 52.5 Text-to-image R@10: 84.9 Text-to-image R@5: 77.2 |
| zero-shot-cross-modal-retrieval-on-flickr30k | COSMOS ViT-B/32 | Image-to-text R@1: 89.9 Image-to-text R@10: 99.3 Image-to-text R@5: 98.8 Text-to-image R@1: 76.1 Text-to-image R@10: 96.2 Text-to-image R@5: 92.8 |
| zero-shot-cross-modal-retrieval-on-flickr30k | COSMOS ViT-B/16 | Image-to-text R@1: 92.9 Image-to-text R@10: 99.9 Image-to-text R@5: 99.4 Text-to-image R@1: 80.3 Text-to-image R@10: 97.6 Text-to-image R@5: 95.3 |
| zero-shot-segmentation-on-ade20k-training | COSMOS ViT-B/16 | mIoU: 17.7 |