4 个月前

从自然语言监督中学习可迁移的视觉模型

从自然语言监督中学习可迁移的视觉模型

摘要

最先进的计算机视觉系统被训练用于预测一组固定的预定义对象类别。这种受限的监督形式限制了它们的通用性和可用性,因为需要额外的标注数据来指定任何其他视觉概念。直接从图像的原始文本中学习是一种有前景的替代方案,它利用了更为广泛的数据来源进行监督。我们证明,通过预测哪些标题与哪些图像匹配这一简单的预训练任务,可以在一个包含4亿个(图像,文本)对的数据集上从零开始高效且可扩展地学习到最先进(SOTA)的图像表示。该数据集是从互联网收集的。预训练完成后,自然语言被用来引用已学的视觉概念(或描述新的概念),从而实现模型在下游任务中的零样本迁移。我们通过在30多个现有的计算机视觉数据集上进行基准测试来研究这种方法的性能,这些数据集涵盖了诸如光学字符识别(OCR)、视频中的动作识别、地理定位以及多种细粒度的对象分类任务。该模型在大多数任务中都能非平凡地迁移,并且通常在不需要任何特定数据集训练的情况下与完全监督基线模型具有竞争力。例如,在ImageNet上进行零样本迁移时,我们的模型达到了与原始ResNet-50相同的准确率,而无需使用其训练过程中所依赖的128万个训练样本中的任何一个。我们将在https://github.com/OpenAI/CLIP发布代码和预训练模型权重。

代码仓库

sberbank-ai/ru-clip
pytorch
GitHub 中提及
sincerass/mvlpt
pytorch
GitHub 中提及
nopperl/clip_arxiv_pmc
GitHub 中提及
mainaksingha01/applenet
pytorch
GitHub 中提及
FreddeFrallan/Multilingual-CLIP
pytorch
GitHub 中提及
AndresPMD/Clip_CMR
pytorch
GitHub 中提及
facebookresearch/brainmagick
pytorch
GitHub 中提及
mlfoundations/open_clip
pytorch
GitHub 中提及
jhaprince/multibully
pytorch
GitHub 中提及
baskargroup/biotrove
pytorch
GitHub 中提及
iejMac/ScriptWriter
pytorch
GitHub 中提及
salesforce/pb-ovd
pytorch
GitHub 中提及
klemens-floege/oneprot
pytorch
GitHub 中提及
sajjjadayobi/CLIPfa
pytorch
GitHub 中提及
michi-3000/eyeclip
pytorch
GitHub 中提及
prabhupad26/100daysofML
pytorch
GitHub 中提及
YvanG/VQGAN-CLIP
pytorch
GitHub 中提及
SforAiDl/CountCLIP
pytorch
GitHub 中提及
dhansmair/flamingo-mini
pytorch
GitHub 中提及
eify/open_clip
pytorch
GitHub 中提及
facebookresearch/clip-rocket
pytorch
GitHub 中提及
zhangxu0963/npc
pytorch
GitHub 中提及
ylqi/count-anything
pytorch
GitHub 中提及
bespontaneous/proteus-pytorch
pytorch
GitHub 中提及
buyeah1109/finc
pytorch
GitHub 中提及
minhanh151/respro
pytorch
GitHub 中提及
minhanh151/pre
pytorch
GitHub 中提及
ericyinyzy/vlattack
tf
GitHub 中提及
ramanakshay/clip
pytorch
GitHub 中提及
facebookresearch/vissl
pytorch
GitHub 中提及
Kaushalya/medclip
jax
GitHub 中提及
baskargroup/Arboretum
pytorch
GitHub 中提及
mertyg/post-hoc-cbm
pytorch
GitHub 中提及
IMvision12/keras-vision-models
pytorch
GitHub 中提及
mlbio-epfl/turtle
pytorch
GitHub 中提及
buyeah1109/KEN
pytorch
GitHub 中提及
NYU-DICE-Lab/open_clip
pytorch
GitHub 中提及
ml-jku/cloob
pytorch
GitHub 中提及
armaank/archlectures
pytorch
GitHub 中提及
eps696/aphantasia
pytorch
GitHub 中提及
fastscience-ai/medflamingo
pytorch
GitHub 中提及
Gahyeonkim09/AAPL
pytorch
GitHub 中提及
brown-palm/ObjectPrompt
pytorch
GitHub 中提及
clip-italian/clip-italian
jax
GitHub 中提及
mainaksingha01/odg-clip
pytorch
GitHub 中提及
nahidalam/open_clip
pytorch
GitHub 中提及
sithu31296/simple-object-tracking
pytorch
GitHub 中提及
bruthyu/bpt-vlm
pytorch
GitHub 中提及
rinnakk/japanese-clip
pytorch
GitHub 中提及
leolee99/CLIP_ITM
pytorch
GitHub 中提及
filipbasara0/simple-clip
pytorch
GitHub 中提及
moein-shariatnia/OpenAI-CLIP
pytorch
GitHub 中提及
lunaproject22/rpa
pytorch
GitHub 中提及
s-a-malik/multi-few
pytorch
GitHub 中提及
pseulki/rococo
pytorch
GitHub 中提及
shivammehta25/clip
pytorch
GitHub 中提及
openai/CLIP
官方
pytorch
GitHub 中提及
redcaps-dataset/redcaps-downloader
pytorch
GitHub 中提及
azshue/TPT
pytorch
GitHub 中提及
giantseaweed/decree
pytorch
GitHub 中提及
ajayjain/vectorascent
pytorch
GitHub 中提及
yuuun/clip_pytorch
pytorch
GitHub 中提及
ai-forever/ru-clip
pytorch
GitHub 中提及
borisdayma/clip-jax
jax
GitHub 中提及
taited/clip-score
pytorch
GitHub 中提及
apple/ml-mobileclip
pytorch
GitHub 中提及

基准测试

基准方法指标
action-recognition-on-rareactCLIP
mWAP: 40.7
few-shot-image-classification-on-imagenet-0CLIP (ViT B/32)
Accuracy: 63.2%
few-shot-image-classification-on-imagenet-0CLIP (ResNet50)
Accuracy: 59.6%
hateful-meme-classification-on-harm-pCLIP
Accuracy: 80.6
F1: 80.3
hateful-meme-classification-on-pridemmCLIP (fine-tuned)
Accuracy: 72.4
F1: 72.3
image-classification-on-objectnetCLIP
Top-1 Accuracy: 72.3
image-classification-on-omnibenchmarkCLIP-RN50
Average Top-1 Accuracy: 42.1
image-to-text-retrieval-on-cocoCLIP (zero-shot)
Recall@1: 58.4
Recall@10: 88.1
Recall@5: 81.5
long-tail-learning-on-coco-mltCLIP(ViT-B/16)
Average mAP: 60.17
long-tail-learning-on-coco-mltCLIP(ResNet-50)
Average mAP: 56.19
long-tail-learning-on-voc-mltCLIP(ViT-B/16)
Average mAP: 85.77
long-tail-learning-on-voc-mltCLIP(ResNet-50)
Average mAP: 84.30
meme-classification-on-hateful-memesCLIP (zero-shot)
ROC-AUC: 0.661
meme-classification-on-multioffCLIP
Accuracy: 62.4
F1: 48.1
object-categorization-on-gritCLIP
Categorization (ablation): 48.1
object-recognition-on-shape-biasCLIP (ViT-B)
shape bias: 79.9
open-vocabulary-attribute-detection-on-ovad-1CLIP VIT-B16
mean average precision: 16.6
prompt-engineering-on-caltech-101CLIP
Harmonic mean: 95.40
prompt-engineering-on-dtdCLIP
Harmonic mean: 56.37
prompt-engineering-on-eurosatCLIP
Harmonic mean: 60.03
prompt-engineering-on-fgvc-aircraftCLIP
Harmonic mean: 31.09
prompt-engineering-on-imagenetCLIP
Harmonic mean: 70.22
prompt-engineering-on-imagenet-aCLIP
Top-1 accuracy %: 47.77
prompt-engineering-on-imagenet-rCLIP
Top-1 accuracy %: 73.96
prompt-engineering-on-imagenet-sCLIP
Top-1 accuracy %: 46.15
prompt-engineering-on-imagenet-v2CLIP
Top-1 accuracy %: 60.83
prompt-engineering-on-oxford-102-flowerCLIP
Harmonic mean: 74.83
prompt-engineering-on-oxford-iiit-pet-datasetCLIP
Harmonic mean: 94.12
prompt-engineering-on-stanford-cars-1CLIP
Harmonic mean: 68.65
prompt-engineering-on-sun397CLIP
Harmonic mean: 72.23
prompt-engineering-on-ucf101CLIP
Harmonic mean: 73.85
semi-supervised-image-classification-on-16CLIP (ResNet-50)
ImageNet Top-1 Accuracy: 40%
text-based-person-retrieval-with-noisyCLIP-C
Rank 10: 90.89
Rank-1: 66.41
Rank-5: 85.15
mAP: 59.36
mINP: 43.02
text-based-person-retrieval-with-noisy-1CLIP-C
Rank 1: 55.25
Rank-10: 81.32
Rank-5: 74.76
mAP: 31.09
mINP: 4.94
text-based-person-retrieval-with-noisy-2CLIP-C
Rank 1: 54.45
Rank 10: 86.70
Rank 5: 77.80
mAP: 42.58
mINP: 21.38
zero-shot-cross-modal-retrieval-on-coco-2014CLIP
Image-to-text R@1: 58.4
Image-to-text R@10: 88.1
Image-to-text R@5: 81.5
Text-to-image R@1: 37.8
Text-to-image R@10: 72.2
Text-to-image R@5: 62.4
zero-shot-cross-modal-retrieval-on-flickr30kCLIP
Image-to-text R@1: 88.0
Image-to-text R@10: 99.4
Image-to-text R@5: 98.7
Text-to-image R@1: 68.7
Text-to-image R@10: 95.2
Text-to-image R@5: 90.6
zero-shot-learning-on-coco-mltViT-B/16
Average mAP: 60.17
zero-shot-learning-on-coco-mltResNet-50
Average mAP: 56.19
zero-shot-learning-on-voc-mltCLIP(ViT-B/16)
Average mAP: 85.77
zero-shot-learning-on-voc-mltCLIP(ResNet-50)
Average mAP: 84.30
zero-shot-transfer-image-classification-onCLIP
Accuracy: 98.4
zero-shot-transfer-image-classification-on-1CLIP(ViT-L/14-336px)
Accuracy (Private): 76.2
zero-shot-transfer-image-classification-on-1CLIP
Accuracy (Public): 31.3
zero-shot-transfer-image-classification-on-1CLIP (ResNet50)
Accuracy (Private): 59.6
zero-shot-transfer-image-classification-on-2CLIP
Accuracy: 58.5
zero-shot-transfer-image-classification-on-3CLIP
Accuracy (Private): 70.1
Accuracy (Public): -
zero-shot-transfer-image-classification-on-4CLIP
Accuracy: 88.9
zero-shot-transfer-image-classification-on-5CLIP
Accuracy (Private): 77.2
Accuracy (Public): -
zero-shot-transfer-image-classification-on-6CLIP
Accuracy (Private): 72.3
Accuracy (Public): -

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
从自然语言监督中学习可迁移的视觉模型 | 论文 | HyperAI超神经