
摘要
扩大对比语言-图像预训练(CLIP)对于增强视觉和多模态模型的能力至关重要。我们推出了EVA-CLIP-18B,这是迄今为止最大且最强大的开源CLIP模型,拥有180亿个参数。仅通过60亿个训练样本的学习,EVA-CLIP-18B在27个广泛认可的图像分类基准测试中实现了平均80.7%的零样本顶级准确率,显著优于其前代产品EVA-CLIP(50亿参数)和其他开源CLIP模型。值得注意的是,尽管维持了来自LAION-2B和COYO-700M的20亿图像-文本对的固定训练数据集,我们观察到随着EVA-CLIP模型规模的增加,性能表现持续提升。该数据集公开可用,并且比其他最先进的CLIP模型所使用的内部数据集(例如DFN-5B、WebLI-10B)小得多。EVA-CLIP-18B展示了EVA风格从弱到强的视觉模型扩展潜力。我们已公开发布该模型的权重,希望促进未来在视觉和多模态基础模型领域的研究。
代码仓库
baaivision/eva
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| zero-shot-transfer-image-classification-on-1 | EVA-CLIP-18B | Accuracy (Private): 83.8 |
| zero-shot-transfer-image-classification-on-17 | EVA-CLIP-18B | Top 1 Accuracy: 95.8 |
| zero-shot-transfer-image-classification-on-2 | EVA-CLIP-18B | Accuracy: 77.7 |
| zero-shot-transfer-image-classification-on-3 | EVA-CLIP-18B | Accuracy (Private): 77.9 |
| zero-shot-transfer-image-classification-on-4 | EVA-CLIP-18B | Accuracy: 95.7 |
| zero-shot-transfer-image-classification-on-5 | EVA-CLIP-18B | Accuracy (Private): 87.3 |
| zero-shot-transfer-image-classification-on-6 | EVA-CLIP-18B | Accuracy (Private): 82.2 |
| zero-shot-transfer-image-classification-on-8 | EVA-CLIP-18B | Accuracy (Private): 74.7 |