Command Palette
Search for a command to run...
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Shuhuai Ren Aston Zhang Yi Zhu Shuai Zhang Shuai Zheng Mu Li Alex Smola Xu Sun

Abstract
This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| open-vocabulary-object-detection-on-lvis-v1-0 | POMP | AP novel-LVIS base training: 25.2 |
| open-vocabulary-semantic-segmentation-on-5 | POMP | hIoU: 84.4 mIoU: 89.4 |
| open-vocabulary-semantic-segmentation-on-coco | POMP | HIoU: 39.1 |
| prompt-engineering-on-imagenet-21k | POMP | Accuracy: 25.3 |
| prompt-engineering-on-imagenet-a | POMP | Top-1 accuracy %: 51.6 |
| prompt-engineering-on-imagenet-r | POMP | Top-1 accuracy %: 77.9 |
| prompt-engineering-on-imagenet-s | POMP | Top-1 accuracy %: 49.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.