RadfordAlec ; KimJong Wook ; HallacyChris ; RameshAditya ; GohGabriel ; AgarwalSandhini ; SastryGirish ; AskellAmanda ; MishkinPamela ; ClarkJack ; KruegerGretchen ; SutskeverIlya

摘要
最先进的计算机视觉系统被训练用于预测一组固定的预定义对象类别。这种受限的监督形式限制了它们的通用性和可用性,因为需要额外的标注数据来指定任何其他视觉概念。直接从图像的原始文本中学习是一种有前景的替代方案,它利用了更为广泛的数据来源进行监督。我们证明,通过预测哪些标题与哪些图像匹配这一简单的预训练任务,可以在一个包含4亿个(图像,文本)对的数据集上从零开始高效且可扩展地学习到最先进(SOTA)的图像表示。该数据集是从互联网收集的。预训练完成后,自然语言被用来引用已学的视觉概念(或描述新的概念),从而实现模型在下游任务中的零样本迁移。我们通过在30多个现有的计算机视觉数据集上进行基准测试来研究这种方法的性能,这些数据集涵盖了诸如光学字符识别(OCR)、视频中的动作识别、地理定位以及多种细粒度的对象分类任务。该模型在大多数任务中都能非平凡地迁移,并且通常在不需要任何特定数据集训练的情况下与完全监督基线模型具有竞争力。例如,在ImageNet上进行零样本迁移时,我们的模型达到了与原始ResNet-50相同的准确率,而无需使用其训练过程中所依赖的128万个训练样本中的任何一个。我们将在https://github.com/OpenAI/CLIP发布代码和预训练模型权重。
代码仓库
sberbank-ai/ru-clip
pytorch
GitHub 中提及
sincerass/mvlpt
pytorch
GitHub 中提及
nopperl/clip_arxiv_pmc
GitHub 中提及
mainaksingha01/applenet
pytorch
GitHub 中提及
FreddeFrallan/Multilingual-CLIP
pytorch
GitHub 中提及
AndresPMD/Clip_CMR
pytorch
GitHub 中提及
facebookresearch/brainmagick
pytorch
GitHub 中提及
mlfoundations/open_clip
pytorch
GitHub 中提及
jhaprince/multibully
pytorch
GitHub 中提及
a736875071/clip-vit-large-patch14
GitHub 中提及
baskargroup/biotrove
pytorch
GitHub 中提及
iejMac/ScriptWriter
pytorch
GitHub 中提及
salesforce/pb-ovd
pytorch
GitHub 中提及
klemens-floege/oneprot
pytorch
GitHub 中提及
sajjjadayobi/CLIPfa
pytorch
GitHub 中提及
michi-3000/eyeclip
pytorch
GitHub 中提及
ZackPashkin/text2cartoon-pytorch-CLIP
pytorch
GitHub 中提及
prabhupad26/100daysofML
pytorch
GitHub 中提及
YvanG/VQGAN-CLIP
pytorch
GitHub 中提及
SforAiDl/CountCLIP
pytorch
GitHub 中提及
dhansmair/flamingo-mini
pytorch
GitHub 中提及
eify/open_clip
pytorch
GitHub 中提及
facebookresearch/clip-rocket
pytorch
GitHub 中提及
pwc-1/Paper-8/tree/main/clip
mindspore
zhangxu0963/npc
pytorch
GitHub 中提及
ylqi/count-anything
pytorch
GitHub 中提及
bespontaneous/proteus-pytorch
pytorch
GitHub 中提及
buyeah1109/finc
pytorch
GitHub 中提及
minhanh151/respro
pytorch
GitHub 中提及
minhanh151/pre
pytorch
GitHub 中提及
ericyinyzy/vlattack
tf
GitHub 中提及
ramanakshay/clip
pytorch
GitHub 中提及
facebookresearch/vissl
pytorch
GitHub 中提及
kynkaat/role-of-imagenet-classes-in-fid
pytorch
GitHub 中提及
Kaushalya/medclip
jax
GitHub 中提及
baskargroup/Arboretum
pytorch
GitHub 中提及
mertyg/post-hoc-cbm
pytorch
GitHub 中提及
madrylab/pretraining-distribution-shift-robustness
pytorch
GitHub 中提及
IMvision12/keras-vision-models
pytorch
GitHub 中提及
mlbio-epfl/turtle
pytorch
GitHub 中提及
buyeah1109/KEN
pytorch
GitHub 中提及
NYU-DICE-Lab/open_clip
pytorch
GitHub 中提及
ml-jku/cloob
pytorch
GitHub 中提及
armaank/archlectures
pytorch
GitHub 中提及
eps696/aphantasia
pytorch
GitHub 中提及
fastscience-ai/medflamingo
pytorch
GitHub 中提及
Gahyeonkim09/AAPL
pytorch
GitHub 中提及
brown-palm/ObjectPrompt
pytorch
GitHub 中提及
clip-italian/clip-italian
jax
GitHub 中提及
mainaksingha01/odg-clip
pytorch
GitHub 中提及
nahidalam/open_clip
pytorch
GitHub 中提及
fiabdu/Commonly-Interesting-Images
GitHub 中提及
sithu31296/simple-object-tracking
pytorch
GitHub 中提及
bruthyu/bpt-vlm
pytorch
GitHub 中提及
linjieli222/hero_video_feature_extractor
pytorch
GitHub 中提及
rinnakk/japanese-clip
pytorch
GitHub 中提及
leolee99/CLIP_ITM
pytorch
GitHub 中提及
filipbasara0/simple-clip
pytorch
GitHub 中提及
moein-shariatnia/OpenAI-CLIP
pytorch
GitHub 中提及
lunaproject22/rpa
pytorch
GitHub 中提及
shunk031/simple-aesthetics-predictor
pytorch
GitHub 中提及
s-a-malik/multi-few
pytorch
GitHub 中提及
pseulki/rococo
pytorch
GitHub 中提及
shivammehta25/clip
pytorch
GitHub 中提及
openai/CLIP
官方
pytorch
GitHub 中提及
redcaps-dataset/redcaps-downloader
pytorch
GitHub 中提及
towhee-io/towhee
pytorch
azshue/TPT
pytorch
GitHub 中提及
giantseaweed/decree
pytorch
GitHub 中提及
ajayjain/vectorascent
pytorch
GitHub 中提及
yuuun/clip_pytorch
pytorch
GitHub 中提及
ai-forever/ru-clip
pytorch
GitHub 中提及
muzairkhattak/multimodal-prompt-learning
pytorch
GitHub 中提及
borisdayma/clip-jax
jax
GitHub 中提及
taited/clip-score
pytorch
GitHub 中提及
apple/ml-mobileclip
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-recognition-on-rareact | CLIP | mWAP: 40.7 |
| few-shot-image-classification-on-imagenet-0 | CLIP (ViT B/32) | Accuracy: 63.2% |
| few-shot-image-classification-on-imagenet-0 | CLIP (ResNet50) | Accuracy: 59.6% |
| hateful-meme-classification-on-harm-p | CLIP | Accuracy: 80.6 F1: 80.3 |
| hateful-meme-classification-on-pridemm | CLIP (fine-tuned) | Accuracy: 72.4 F1: 72.3 |
| image-classification-on-objectnet | CLIP | Top-1 Accuracy: 72.3 |
| image-classification-on-omnibenchmark | CLIP-RN50 | Average Top-1 Accuracy: 42.1 |
| image-to-text-retrieval-on-coco | CLIP (zero-shot) | Recall@1: 58.4 Recall@10: 88.1 Recall@5: 81.5 |
| long-tail-learning-on-coco-mlt | CLIP(ViT-B/16) | Average mAP: 60.17 |
| long-tail-learning-on-coco-mlt | CLIP(ResNet-50) | Average mAP: 56.19 |
| long-tail-learning-on-voc-mlt | CLIP(ViT-B/16) | Average mAP: 85.77 |
| long-tail-learning-on-voc-mlt | CLIP(ResNet-50) | Average mAP: 84.30 |
| meme-classification-on-hateful-memes | CLIP (zero-shot) | ROC-AUC: 0.661 |
| meme-classification-on-multioff | CLIP | Accuracy: 62.4 F1: 48.1 |
| object-categorization-on-grit | CLIP | Categorization (ablation): 48.1 |
| object-recognition-on-shape-bias | CLIP (ViT-B) | shape bias: 79.9 |
| open-vocabulary-attribute-detection-on-ovad-1 | CLIP VIT-B16 | mean average precision: 16.6 |
| prompt-engineering-on-caltech-101 | CLIP | Harmonic mean: 95.40 |
| prompt-engineering-on-dtd | CLIP | Harmonic mean: 56.37 |
| prompt-engineering-on-eurosat | CLIP | Harmonic mean: 60.03 |
| prompt-engineering-on-fgvc-aircraft | CLIP | Harmonic mean: 31.09 |
| prompt-engineering-on-imagenet | CLIP | Harmonic mean: 70.22 |
| prompt-engineering-on-imagenet-a | CLIP | Top-1 accuracy %: 47.77 |
| prompt-engineering-on-imagenet-r | CLIP | Top-1 accuracy %: 73.96 |
| prompt-engineering-on-imagenet-s | CLIP | Top-1 accuracy %: 46.15 |
| prompt-engineering-on-imagenet-v2 | CLIP | Top-1 accuracy %: 60.83 |
| prompt-engineering-on-oxford-102-flower | CLIP | Harmonic mean: 74.83 |
| prompt-engineering-on-oxford-iiit-pet-dataset | CLIP | Harmonic mean: 94.12 |
| prompt-engineering-on-stanford-cars-1 | CLIP | Harmonic mean: 68.65 |
| prompt-engineering-on-sun397 | CLIP | Harmonic mean: 72.23 |
| prompt-engineering-on-ucf101 | CLIP | Harmonic mean: 73.85 |
| semi-supervised-image-classification-on-16 | CLIP (ResNet-50) | ImageNet Top-1 Accuracy: 40% |
| text-based-person-retrieval-with-noisy | CLIP-C | Rank 10: 90.89 Rank-1: 66.41 Rank-5: 85.15 mAP: 59.36 mINP: 43.02 |
| text-based-person-retrieval-with-noisy-1 | CLIP-C | Rank 1: 55.25 Rank-10: 81.32 Rank-5: 74.76 mAP: 31.09 mINP: 4.94 |
| text-based-person-retrieval-with-noisy-2 | CLIP-C | Rank 1: 54.45 Rank 10: 86.70 Rank 5: 77.80 mAP: 42.58 mINP: 21.38 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | CLIP | Image-to-text R@1: 58.4 Image-to-text R@10: 88.1 Image-to-text R@5: 81.5 Text-to-image R@1: 37.8 Text-to-image R@10: 72.2 Text-to-image R@5: 62.4 |
| zero-shot-cross-modal-retrieval-on-flickr30k | CLIP | Image-to-text R@1: 88.0 Image-to-text R@10: 99.4 Image-to-text R@5: 98.7 Text-to-image R@1: 68.7 Text-to-image R@10: 95.2 Text-to-image R@5: 90.6 |
| zero-shot-learning-on-coco-mlt | ViT-B/16 | Average mAP: 60.17 |
| zero-shot-learning-on-coco-mlt | ResNet-50 | Average mAP: 56.19 |
| zero-shot-learning-on-voc-mlt | CLIP(ViT-B/16) | Average mAP: 85.77 |
| zero-shot-learning-on-voc-mlt | CLIP(ResNet-50) | Average mAP: 84.30 |
| zero-shot-transfer-image-classification-on | CLIP | Accuracy: 98.4 |
| zero-shot-transfer-image-classification-on-1 | CLIP(ViT-L/14-336px) | Accuracy (Private): 76.2 |
| zero-shot-transfer-image-classification-on-1 | CLIP | Accuracy (Public): 31.3 |
| zero-shot-transfer-image-classification-on-1 | CLIP (ResNet50) | Accuracy (Private): 59.6 |
| zero-shot-transfer-image-classification-on-2 | CLIP | Accuracy: 58.5 |
| zero-shot-transfer-image-classification-on-3 | CLIP | Accuracy (Private): 70.1 Accuracy (Public): - |
| zero-shot-transfer-image-classification-on-4 | CLIP | Accuracy: 88.9 |
| zero-shot-transfer-image-classification-on-5 | CLIP | Accuracy (Private): 77.2 Accuracy (Public): - |
| zero-shot-transfer-image-classification-on-6 | CLIP | Accuracy (Private): 72.3 Accuracy (Public): - |