HyperAIHyperAI

Command Palette

Search for a command to run...

Learning Transferable Visual Models From Natural Language Supervision

Abstract

State-of-the-art computer vision systems are trained to predict a fixed setof predetermined object categories. This restricted form of supervision limitstheir generality and usability since additional labeled data is needed tospecify any other visual concept. Learning directly from raw text about imagesis a promising alternative which leverages a much broader source ofsupervision. We demonstrate that the simple pre-training task of predictingwhich caption goes with which image is an efficient and scalable way to learnSOTA image representations from scratch on a dataset of 400 million (image,text) pairs collected from the internet. After pre-training, natural languageis used to reference learned visual concepts (or describe new ones) enablingzero-shot transfer of the model to downstream tasks. We study the performanceof this approach by benchmarking on over 30 different existing computer visiondatasets, spanning tasks such as OCR, action recognition in videos,geo-localization, and many types of fine-grained object classification. Themodel transfers non-trivially to most tasks and is often competitive with afully supervised baseline without the need for any dataset specific training.For instance, we match the accuracy of the original ResNet-50 on ImageNetzero-shot without needing to use any of the 1.28 million training examples itwas trained on. We release our code and pre-trained model weights athttps://github.com/OpenAI/CLIP.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp