HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Learning Transferable Visual Models From Natural Language Supervision

Learning Transferable Visual Models From Natural Language Supervision

Abstract

State-of-the-art computer vision systems are trained to predict a fixed setof predetermined object categories. This restricted form of supervision limitstheir generality and usability since additional labeled data is needed tospecify any other visual concept. Learning directly from raw text about imagesis a promising alternative which leverages a much broader source ofsupervision. We demonstrate that the simple pre-training task of predictingwhich caption goes with which image is an efficient and scalable way to learnSOTA image representations from scratch on a dataset of 400 million (image,text) pairs collected from the internet. After pre-training, natural languageis used to reference learned visual concepts (or describe new ones) enablingzero-shot transfer of the model to downstream tasks. We study the performanceof this approach by benchmarking on over 30 different existing computer visiondatasets, spanning tasks such as OCR, action recognition in videos,geo-localization, and many types of fine-grained object classification. Themodel transfers non-trivially to most tasks and is often competitive with afully supervised baseline without the need for any dataset specific training.For instance, we match the accuracy of the original ResNet-50 on ImageNetzero-shot without needing to use any of the 1.28 million training examples itwas trained on. We release our code and pre-trained model weights athttps://github.com/OpenAI/CLIP.

Code Repositories

sberbank-ai/ru-clip
pytorch
Mentioned in GitHub
sincerass/mvlpt
pytorch
Mentioned in GitHub
nopperl/clip_arxiv_pmc
Mentioned in GitHub
mainaksingha01/applenet
pytorch
Mentioned in GitHub
FreddeFrallan/Multilingual-CLIP
pytorch
Mentioned in GitHub
AndresPMD/Clip_CMR
pytorch
Mentioned in GitHub
facebookresearch/brainmagick
pytorch
Mentioned in GitHub
mlfoundations/open_clip
pytorch
Mentioned in GitHub
jhaprince/multibully
pytorch
Mentioned in GitHub
baskargroup/biotrove
pytorch
Mentioned in GitHub
iejMac/ScriptWriter
pytorch
Mentioned in GitHub
salesforce/pb-ovd
pytorch
Mentioned in GitHub
klemens-floege/oneprot
pytorch
Mentioned in GitHub
sajjjadayobi/CLIPfa
pytorch
Mentioned in GitHub
michi-3000/eyeclip
pytorch
Mentioned in GitHub
ZackPashkin/text2cartoon-pytorch-CLIP
pytorch
Mentioned in GitHub
prabhupad26/100daysofML
pytorch
Mentioned in GitHub
YvanG/VQGAN-CLIP
pytorch
Mentioned in GitHub
SforAiDl/CountCLIP
pytorch
Mentioned in GitHub
dhansmair/flamingo-mini
pytorch
Mentioned in GitHub
eify/open_clip
pytorch
Mentioned in GitHub
facebookresearch/clip-rocket
pytorch
Mentioned in GitHub
zhangxu0963/npc
pytorch
Mentioned in GitHub
ylqi/count-anything
pytorch
Mentioned in GitHub
bespontaneous/proteus-pytorch
pytorch
Mentioned in GitHub
buyeah1109/finc
pytorch
Mentioned in GitHub
minhanh151/respro
pytorch
Mentioned in GitHub
minhanh151/pre
pytorch
Mentioned in GitHub
ericyinyzy/vlattack
tf
Mentioned in GitHub
ramanakshay/clip
pytorch
Mentioned in GitHub
facebookresearch/vissl
pytorch
Mentioned in GitHub
kynkaat/role-of-imagenet-classes-in-fid
pytorch
Mentioned in GitHub
Kaushalya/medclip
jax
Mentioned in GitHub
baskargroup/Arboretum
pytorch
Mentioned in GitHub
mertyg/post-hoc-cbm
pytorch
Mentioned in GitHub
IMvision12/keras-vision-models
pytorch
Mentioned in GitHub
mlbio-epfl/turtle
pytorch
Mentioned in GitHub
buyeah1109/KEN
pytorch
Mentioned in GitHub
NYU-DICE-Lab/open_clip
pytorch
Mentioned in GitHub
ml-jku/cloob
pytorch
Mentioned in GitHub
armaank/archlectures
pytorch
Mentioned in GitHub
eps696/aphantasia
pytorch
Mentioned in GitHub
fastscience-ai/medflamingo
pytorch
Mentioned in GitHub
Gahyeonkim09/AAPL
pytorch
Mentioned in GitHub
brown-palm/ObjectPrompt
pytorch
Mentioned in GitHub
clip-italian/clip-italian
jax
Mentioned in GitHub
mainaksingha01/odg-clip
pytorch
Mentioned in GitHub
nahidalam/open_clip
pytorch
Mentioned in GitHub
sithu31296/simple-object-tracking
pytorch
Mentioned in GitHub
bruthyu/bpt-vlm
pytorch
Mentioned in GitHub
rinnakk/japanese-clip
pytorch
Mentioned in GitHub
leolee99/CLIP_ITM
pytorch
Mentioned in GitHub
filipbasara0/simple-clip
pytorch
Mentioned in GitHub
moein-shariatnia/OpenAI-CLIP
pytorch
Mentioned in GitHub
lunaproject22/rpa
pytorch
Mentioned in GitHub
shunk031/simple-aesthetics-predictor
pytorch
Mentioned in GitHub
s-a-malik/multi-few
pytorch
Mentioned in GitHub
pseulki/rococo
pytorch
Mentioned in GitHub
shivammehta25/clip
pytorch
Mentioned in GitHub
openai/CLIP
Official
pytorch
Mentioned in GitHub
redcaps-dataset/redcaps-downloader
pytorch
Mentioned in GitHub
azshue/TPT
pytorch
Mentioned in GitHub
giantseaweed/decree
pytorch
Mentioned in GitHub
ajayjain/vectorascent
pytorch
Mentioned in GitHub
yuuun/clip_pytorch
pytorch
Mentioned in GitHub
ai-forever/ru-clip
pytorch
Mentioned in GitHub
borisdayma/clip-jax
jax
Mentioned in GitHub
taited/clip-score
pytorch
Mentioned in GitHub
apple/ml-mobileclip
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-recognition-on-rareactCLIP
mWAP: 40.7
few-shot-image-classification-on-imagenet-0CLIP (ViT B/32)
Accuracy: 63.2%
few-shot-image-classification-on-imagenet-0CLIP (ResNet50)
Accuracy: 59.6%
hateful-meme-classification-on-harm-pCLIP
Accuracy: 80.6
F1: 80.3
hateful-meme-classification-on-pridemmCLIP (fine-tuned)
Accuracy: 72.4
F1: 72.3
image-classification-on-objectnetCLIP
Top-1 Accuracy: 72.3
image-classification-on-omnibenchmarkCLIP-RN50
Average Top-1 Accuracy: 42.1
image-to-text-retrieval-on-cocoCLIP (zero-shot)
Recall@1: 58.4
Recall@10: 88.1
Recall@5: 81.5
long-tail-learning-on-coco-mltCLIP(ViT-B/16)
Average mAP: 60.17
long-tail-learning-on-coco-mltCLIP(ResNet-50)
Average mAP: 56.19
long-tail-learning-on-voc-mltCLIP(ViT-B/16)
Average mAP: 85.77
long-tail-learning-on-voc-mltCLIP(ResNet-50)
Average mAP: 84.30
meme-classification-on-hateful-memesCLIP (zero-shot)
ROC-AUC: 0.661
meme-classification-on-multioffCLIP
Accuracy: 62.4
F1: 48.1
object-categorization-on-gritCLIP
Categorization (ablation): 48.1
object-recognition-on-shape-biasCLIP (ViT-B)
shape bias: 79.9
open-vocabulary-attribute-detection-on-ovad-1CLIP VIT-B16
mean average precision: 16.6
prompt-engineering-on-caltech-101CLIP
Harmonic mean: 95.40
prompt-engineering-on-dtdCLIP
Harmonic mean: 56.37
prompt-engineering-on-eurosatCLIP
Harmonic mean: 60.03
prompt-engineering-on-fgvc-aircraftCLIP
Harmonic mean: 31.09
prompt-engineering-on-imagenetCLIP
Harmonic mean: 70.22
prompt-engineering-on-imagenet-aCLIP
Top-1 accuracy %: 47.77
prompt-engineering-on-imagenet-rCLIP
Top-1 accuracy %: 73.96
prompt-engineering-on-imagenet-sCLIP
Top-1 accuracy %: 46.15
prompt-engineering-on-imagenet-v2CLIP
Top-1 accuracy %: 60.83
prompt-engineering-on-oxford-102-flowerCLIP
Harmonic mean: 74.83
prompt-engineering-on-oxford-iiit-pet-datasetCLIP
Harmonic mean: 94.12
prompt-engineering-on-stanford-cars-1CLIP
Harmonic mean: 68.65
prompt-engineering-on-sun397CLIP
Harmonic mean: 72.23
prompt-engineering-on-ucf101CLIP
Harmonic mean: 73.85
semi-supervised-image-classification-on-16CLIP (ResNet-50)
ImageNet Top-1 Accuracy: 40%
text-based-person-retrieval-with-noisyCLIP-C
Rank 10: 90.89
Rank-1: 66.41
Rank-5: 85.15
mAP: 59.36
mINP: 43.02
text-based-person-retrieval-with-noisy-1CLIP-C
Rank 1: 55.25
Rank-10: 81.32
Rank-5: 74.76
mAP: 31.09
mINP: 4.94
text-based-person-retrieval-with-noisy-2CLIP-C
Rank 1: 54.45
Rank 10: 86.70
Rank 5: 77.80
mAP: 42.58
mINP: 21.38
zero-shot-cross-modal-retrieval-on-coco-2014CLIP
Image-to-text R@1: 58.4
Image-to-text R@10: 88.1
Image-to-text R@5: 81.5
Text-to-image R@1: 37.8
Text-to-image R@10: 72.2
Text-to-image R@5: 62.4
zero-shot-cross-modal-retrieval-on-flickr30kCLIP
Image-to-text R@1: 88.0
Image-to-text R@10: 99.4
Image-to-text R@5: 98.7
Text-to-image R@1: 68.7
Text-to-image R@10: 95.2
Text-to-image R@5: 90.6
zero-shot-learning-on-coco-mltViT-B/16
Average mAP: 60.17
zero-shot-learning-on-coco-mltResNet-50
Average mAP: 56.19
zero-shot-learning-on-voc-mltCLIP(ViT-B/16)
Average mAP: 85.77
zero-shot-learning-on-voc-mltCLIP(ResNet-50)
Average mAP: 84.30
zero-shot-transfer-image-classification-onCLIP
Accuracy: 98.4
zero-shot-transfer-image-classification-on-1CLIP(ViT-L/14-336px)
Accuracy (Private): 76.2
zero-shot-transfer-image-classification-on-1CLIP
Accuracy (Public): 31.3
zero-shot-transfer-image-classification-on-1CLIP (ResNet50)
Accuracy (Private): 59.6
zero-shot-transfer-image-classification-on-2CLIP
Accuracy: 58.5
zero-shot-transfer-image-classification-on-3CLIP
Accuracy (Private): 70.1
Accuracy (Public): -
zero-shot-transfer-image-classification-on-4CLIP
Accuracy: 88.9
zero-shot-transfer-image-classification-on-5CLIP
Accuracy (Private): 77.2
Accuracy (Public): -
zero-shot-transfer-image-classification-on-6CLIP
Accuracy (Private): 72.3
Accuracy (Public): -

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Learning Transferable Visual Models From Natural Language Supervision | Papers | HyperAI