5 months ago

Learning to Prompt for Vision-Language Models

Zhou Kaiyang ; Yang Jingkang ; Loy Chen Change ; Liu Ziwei

Abstract

Large pre-trained vision-language models like CLIP have shown great potentialin learning representations that are transferable across a wide range ofdownstream tasks. Different from the traditional representation learning thatis based mostly on discretized labels, vision-language pre-training alignsimages and texts in a common feature space, which allows zero-shot transfer toa downstream task via prompting, i.e., classification weights are synthesizedfrom natural language describing classes of interest. In this work, we showthat a major challenge for deploying such models in practice is promptengineering, which requires domain expertise and is extremely time-consuming --one needs to spend a significant amount of time on words tuning since a slightchange in wording could have a huge impact on performance. Inspired by recentadvances in prompt learning research in natural language processing (NLP), wepropose Context Optimization (CoOp), a simple approach specifically foradapting CLIP-like vision-language models for downstream image recognition.Concretely, CoOp models a prompt's context words with learnable vectors whilethe entire pre-trained parameters are kept fixed. To handle different imagerecognition tasks, we provide two implementations of CoOp: unified context andclass-specific context. Through extensive experiments on 11 datasets, wedemonstrate that CoOp requires as few as one or two shots to beat hand-craftedprompts with a decent margin and is able to gain significant improvements overprompt engineering with more shots, e.g., with 16 shots the average gain isaround 15% (with the highest reaching over 45%). Despite being a learning-basedapproach, CoOp achieves superb domain generalization performance compared withthe zero-shot model using hand-crafted prompts.