5 months ago

Learning Transferable Visual Models From Natural Language Supervision

Radford Alec ; Kim Jong Wook ; Hallacy Chris ; Ramesh Aditya ; Goh Gabriel ; Agarwal Sandhini ; Sastry Girish ; Askell Amanda ; Mishkin Pamela ; Clark

Abstract

State-of-the-art computer vision systems are trained to predict a fixed setof predetermined object categories. This restricted form of supervision limitstheir generality and usability since additional labeled data is needed tospecify any other visual concept. Learning directly from raw text about imagesis a promising alternative which leverages a much broader source ofsupervision. We demonstrate that the simple pre-training task of predictingwhich caption goes with which image is an efficient and scalable way to learnSOTA image representations from scratch on a dataset of 400 million (image,text) pairs collected from the internet. After pre-training, natural languageis used to reference learned visual concepts (or describe new ones) enablingzero-shot transfer of the model to downstream tasks. We study the performanceof this approach by benchmarking on over 30 different existing computer visiondatasets, spanning tasks such as OCR, action recognition in videos,geo-localization, and many types of fine-grained object classification. Themodel transfers non-trivially to most tasks and is often competitive with afully supervised baseline without the need for any dataset specific training.For instance, we match the accuracy of the original ResNet-50 on ImageNetzero-shot without needing to use any of the 1.28 million training examples itwas trained on. We release our code and pre-trained model weights athttps://github.com/OpenAI/CLIP.

Code Repositories

sberbank-ai/ru-clip

pytorch

Mentioned in GitHub

OML-Team/open-metric-learning

pytorch

sincerass/mvlpt

pytorch

Mentioned in GitHub

nopperl/clip_arxiv_pmc

Mentioned in GitHub

mainaksingha01/applenet

pytorch

Mentioned in GitHub

FreddeFrallan/Multilingual-CLIP

pytorch

Mentioned in GitHub

AndresPMD/Clip_CMR

pytorch

Mentioned in GitHub

facebookresearch/brainmagick

pytorch

Mentioned in GitHub

mlfoundations/open_clip

pytorch

Mentioned in GitHub

jhaprince/multibully

pytorch

Mentioned in GitHub

a736875071/clip-vit-large-patch14

Mentioned in GitHub

baskargroup/biotrove

pytorch

Mentioned in GitHub

iejMac/ScriptWriter

pytorch

Mentioned in GitHub

salesforce/pb-ovd

pytorch

Mentioned in GitHub

klemens-floege/oneprot

pytorch

Mentioned in GitHub

sajjjadayobi/CLIPfa

pytorch

Mentioned in GitHub

2023-MindSpore-4/Code12/tree/main/MindFormers/clip

mindspore

michi-3000/eyeclip

pytorch

Mentioned in GitHub

ZackPashkin/text2cartoon-pytorch-CLIP

pytorch

Mentioned in GitHub

prabhupad26/100daysofML

pytorch

Mentioned in GitHub

YvanG/VQGAN-CLIP

pytorch

Mentioned in GitHub

SforAiDl/CountCLIP

pytorch

Mentioned in GitHub

dhansmair/flamingo-mini

pytorch

Mentioned in GitHub

eify/open_clip

pytorch

Mentioned in GitHub

facebookresearch/clip-rocket

pytorch

Mentioned in GitHub

pwc-1/Paper-8/tree/main/clip

mindspore

PaddlePaddle/PASSL/blob/main/docs/Train_CLIP_model.md

paddle

zhangxu0963/npc

pytorch

Mentioned in GitHub

ylqi/count-anything

pytorch

Mentioned in GitHub

bespontaneous/proteus-pytorch

pytorch

Mentioned in GitHub

buyeah1109/finc

pytorch

Mentioned in GitHub

2024-MindSpore-1/Code2/tree/main/model-1/clip

mindspore

minhanh151/respro

pytorch

Mentioned in GitHub

minhanh151/pre

pytorch

Mentioned in GitHub

ericyinyzy/vlattack

Mentioned in GitHub

ramanakshay/clip

pytorch

Mentioned in GitHub

facebookresearch/vissl

pytorch

Mentioned in GitHub

kynkaat/role-of-imagenet-classes-in-fid

pytorch

Mentioned in GitHub

Kaushalya/medclip

jax

Mentioned in GitHub

baskargroup/Arboretum

pytorch

Mentioned in GitHub

mertyg/post-hoc-cbm

pytorch

Mentioned in GitHub

madrylab/pretraining-distribution-shift-robustness

pytorch

Mentioned in GitHub

IMvision12/keras-vision-models

pytorch

Mentioned in GitHub

mlbio-epfl/turtle

pytorch

Mentioned in GitHub

buyeah1109/KEN

pytorch

Mentioned in GitHub

NYU-DICE-Lab/open_clip

pytorch

Mentioned in GitHub

ml-jku/cloob

pytorch

Mentioned in GitHub

armaank/archlectures

pytorch

Mentioned in GitHub

eps696/aphantasia

pytorch

Mentioned in GitHub

fastscience-ai/medflamingo

pytorch

Mentioned in GitHub

Gahyeonkim09/AAPL

pytorch

Mentioned in GitHub

brown-palm/ObjectPrompt

pytorch

Mentioned in GitHub

clip-italian/clip-italian

jax

Mentioned in GitHub

shkarupa-alex/tfclip

mainaksingha01/odg-clip

pytorch

Mentioned in GitHub

nahidalam/open_clip

pytorch

Mentioned in GitHub

fiabdu/Commonly-Interesting-Images

Mentioned in GitHub

sithu31296/simple-object-tracking

pytorch

Mentioned in GitHub

bruthyu/bpt-vlm

pytorch

Mentioned in GitHub

linjieli222/hero_video_feature_extractor

pytorch

Mentioned in GitHub

rinnakk/japanese-clip

pytorch

Mentioned in GitHub

leolee99/CLIP_ITM

pytorch

Mentioned in GitHub

filipbasara0/simple-clip

pytorch

Mentioned in GitHub

moein-shariatnia/OpenAI-CLIP

pytorch

Mentioned in GitHub

lunaproject22/rpa

pytorch

Mentioned in GitHub

shunk031/simple-aesthetics-predictor

pytorch

Mentioned in GitHub

s-a-malik/multi-few

pytorch

Mentioned in GitHub

pseulki/rococo

pytorch

Mentioned in GitHub

shivammehta25/clip

pytorch

Mentioned in GitHub

openai/CLIP

Official

pytorch

Mentioned in GitHub

alibaba/EasyNLP

jax

redcaps-dataset/redcaps-downloader

pytorch

Mentioned in GitHub

towhee-io/towhee

pytorch

azshue/TPT

pytorch

Mentioned in GitHub

giantseaweed/decree

pytorch

Mentioned in GitHub

ajayjain/vectorascent

pytorch

Mentioned in GitHub

yuuun/clip_pytorch

pytorch

Mentioned in GitHub

ai-forever/ru-clip

pytorch

Mentioned in GitHub

muzairkhattak/multimodal-prompt-learning

pytorch

Mentioned in GitHub

borisdayma/clip-jax

jax

Mentioned in GitHub

taited/clip-score

pytorch

Mentioned in GitHub

apple/ml-mobileclip

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
action-recognition-on-rareact	CLIP	mWAP: 40.7
few-shot-image-classification-on-imagenet-0	CLIP (ViT B/32)	Accuracy: 63.2%
few-shot-image-classification-on-imagenet-0	CLIP (ResNet50)	Accuracy: 59.6%
hateful-meme-classification-on-harm-p	CLIP	Accuracy: 80.6 F1: 80.3
hateful-meme-classification-on-pridemm	CLIP (fine-tuned)	Accuracy: 72.4 F1: 72.3
image-classification-on-objectnet	CLIP	Top-1 Accuracy: 72.3
image-classification-on-omnibenchmark	CLIP-RN50	Average Top-1 Accuracy: 42.1
image-to-text-retrieval-on-coco	CLIP (zero-shot)	Recall@1: 58.4 Recall@10: 88.1 Recall@5: 81.5
long-tail-learning-on-coco-mlt	CLIP(ViT-B/16)	Average mAP: 60.17
long-tail-learning-on-coco-mlt	CLIP(ResNet-50)	Average mAP: 56.19
long-tail-learning-on-voc-mlt	CLIP(ViT-B/16)	Average mAP: 85.77
long-tail-learning-on-voc-mlt	CLIP(ResNet-50)	Average mAP: 84.30
meme-classification-on-hateful-memes	CLIP (zero-shot)	ROC-AUC: 0.661
meme-classification-on-multioff	CLIP	Accuracy: 62.4 F1: 48.1
object-categorization-on-grit	CLIP	Categorization (ablation): 48.1
object-recognition-on-shape-bias	CLIP (ViT-B)	shape bias: 79.9
open-vocabulary-attribute-detection-on-ovad-1	CLIP VIT-B16	mean average precision: 16.6
prompt-engineering-on-caltech-101	CLIP	Harmonic mean: 95.40
prompt-engineering-on-dtd	CLIP	Harmonic mean: 56.37
prompt-engineering-on-eurosat	CLIP	Harmonic mean: 60.03
prompt-engineering-on-fgvc-aircraft	CLIP	Harmonic mean: 31.09
prompt-engineering-on-imagenet	CLIP	Harmonic mean: 70.22
prompt-engineering-on-imagenet-a	CLIP	Top-1 accuracy %: 47.77
prompt-engineering-on-imagenet-r	CLIP	Top-1 accuracy %: 73.96
prompt-engineering-on-imagenet-s	CLIP	Top-1 accuracy %: 46.15
prompt-engineering-on-imagenet-v2	CLIP	Top-1 accuracy %: 60.83
prompt-engineering-on-oxford-102-flower	CLIP	Harmonic mean: 74.83
prompt-engineering-on-oxford-iiit-pet-dataset	CLIP	Harmonic mean: 94.12
prompt-engineering-on-stanford-cars-1	CLIP	Harmonic mean: 68.65
prompt-engineering-on-sun397	CLIP	Harmonic mean: 72.23
prompt-engineering-on-ucf101	CLIP	Harmonic mean: 73.85
semi-supervised-image-classification-on-16	CLIP (ResNet-50)	ImageNet Top-1 Accuracy: 40%
text-based-person-retrieval-with-noisy	CLIP-C	Rank 10: 90.89 Rank-1: 66.41 Rank-5: 85.15 mAP: 59.36 mINP: 43.02
text-based-person-retrieval-with-noisy-1	CLIP-C	Rank 1: 55.25 Rank-10: 81.32 Rank-5: 74.76 mAP: 31.09 mINP: 4.94
text-based-person-retrieval-with-noisy-2	CLIP-C	Rank 1: 54.45 Rank 10: 86.70 Rank 5: 77.80 mAP: 42.58 mINP: 21.38
zero-shot-cross-modal-retrieval-on-coco-2014	CLIP	Image-to-text R@1: 58.4 Image-to-text R@10: 88.1 Image-to-text R@5: 81.5 Text-to-image R@1: 37.8 Text-to-image R@10: 72.2 Text-to-image R@5: 62.4
zero-shot-cross-modal-retrieval-on-flickr30k	CLIP	Image-to-text R@1: 88.0 Image-to-text R@10: 99.4 Image-to-text R@5: 98.7 Text-to-image R@1: 68.7 Text-to-image R@10: 95.2 Text-to-image R@5: 90.6
zero-shot-learning-on-coco-mlt	ViT-B/16	Average mAP: 60.17
zero-shot-learning-on-coco-mlt	ResNet-50	Average mAP: 56.19
zero-shot-learning-on-voc-mlt	CLIP(ViT-B/16)	Average mAP: 85.77
zero-shot-learning-on-voc-mlt	CLIP(ResNet-50)	Average mAP: 84.30
zero-shot-transfer-image-classification-on	CLIP	Accuracy: 98.4
zero-shot-transfer-image-classification-on-1	CLIP（ViT-L/14-336px）	Accuracy (Private): 76.2
zero-shot-transfer-image-classification-on-1	CLIP	Accuracy (Public): 31.3
zero-shot-transfer-image-classification-on-1	CLIP (ResNet50)	Accuracy (Private): 59.6
zero-shot-transfer-image-classification-on-2	CLIP	Accuracy: 58.5
zero-shot-transfer-image-classification-on-3	CLIP	Accuracy (Private): 70.1 Accuracy (Public): -
zero-shot-transfer-image-classification-on-4	CLIP	Accuracy: 88.9
zero-shot-transfer-image-classification-on-5	CLIP	Accuracy (Private): 77.2 Accuracy (Public): -
zero-shot-transfer-image-classification-on-6	CLIP	Accuracy (Private): 72.3 Accuracy (Public): -

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Learning Transferable Visual Models From Natural Language Supervision

Radford Alec ; Kim Jong Wook ; Hallacy Chris ; Ramesh Aditya ; Goh Gabriel ; Agarwal Sandhini ; Sastry Girish ; Askell Amanda ; Mishkin Pamela ; Clark3 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Radford Alec ; Kim Jong Wook ; Hallacy Chris ; Ramesh Aditya ; Goh Gabriel ; Agarwal Sandhini ; Sastry Girish ; Askell Amanda ; Mishkin Pamela ; Clark