
摘要
CLIP(Radford 等,2021)的巨大成功推动了视觉-语言对比学习在预训练领域的研究与应用。在本工作中,我们构建了一个大规模的中英文图像-文本配对数据集,其中大部分数据来源于公开可获取的数据集,并基于该新数据集对中文CLIP模型进行了预训练。我们开发了五种不同规模的中文CLIP模型,参数量范围从7700万到9.58亿不等。此外,我们提出了一种两阶段预训练方法:首先在图像编码器冻结的条件下进行训练,随后对所有模型参数进行联合优化,以进一步提升模型性能。大量实验结果表明,中文CLIP在零样本学习(zero-shot learning)和微调(fine-tuning)两种设置下,均在MUGE、Flickr30K-CN和COCO-CN基准上取得了当前最优(state-of-the-art)的性能表现;同时,在ELEVATER基准(Li 等,2022)上的零样本图像分类任务中也展现出具有竞争力的性能。相关代码、模型及演示已开源,详见:https://github.com/OFA-Sys/Chinese-CLIP
代码仓库
ofa-sys/chinese-clip
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-retrieval-on-coco-cn | CN-CLIP (ViT-B/16) | R@1: 77.0 R@10: 99.0 R@5: 97.1 |
| image-retrieval-on-coco-cn | CN-CLIP (ViT-H/14) | R@1: 81.5 R@10: 99.1 R@5: 96.9 |
| image-retrieval-on-coco-cn | CN-CLIP (ViT-L/14@336px) | R@1: 80.1 R@10: 99.2 R@5: 96.7 |
| image-retrieval-on-coco-cn | CN-CLIP (ViT-L/14) | R@1: 78.9 R@10: 99.0 R@5: 96.3 |
| image-retrieval-on-coco-cn | CN-CLIP (RN50) | R@1: 66.8 R@10: 97.0 R@5: 91.1 |
| image-retrieval-on-flickr30k-cn | CN-CLIP (RN50) | R@1: 66.7 R@10: 94.1 R@5: 89.4 |
| image-retrieval-on-flickr30k-cn | CN-CLIP (ViT-L/14@336px) | R@1: 84.4 R@10: 98.7 R@5: 97.1 |
| image-retrieval-on-flickr30k-cn | CN-CLIP (ViT-H/14) | R@1: 83.8 R@10: 98.6 R@5: 96.9 |
| image-retrieval-on-flickr30k-cn | CN-CLIP (ViT-B/16) | R@1: 79.1 R@10: 97.4 R@5: 94.8 |
| image-retrieval-on-flickr30k-cn | CN-CLIP (ViT-L/14) | R@1: 82.7 R@10: 98.6 R@5: 96.7 |
| image-retrieval-on-muge-retrieval | CN-CLIP (ViT-H/14) | Mean Recall: 83.6 R@1: 68.9 R@10: 93.1 R@5: 88.7 |
| image-retrieval-on-muge-retrieval | CN-CLIP (RN50) | Mean Recall: 69.2 R@1: 48.6 R@10: 84.0 R@5: 75.1 |
| image-retrieval-on-muge-retrieval | CN-CLIP (ViT-B/16) | Mean Recall: 77.4 R@1: 58.4 R@10: 90.0 R@5: 83.6 |
| image-retrieval-on-muge-retrieval | CN-CLIP (ViT-L/14) | Mean Recall: 80.1 R@1: 63.3 R@10: 91.3 R@5: 85.6 |
| image-retrieval-on-muge-retrieval | CN-CLIP (ViT-L/14@336px) | Mean Recall: 81.3 R@1: 65.3 R@10: 92.1 R@5: 86.7 |