Command Palette
Search for a command to run...
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Size Wu; Wenwei Zhang; Lumin Xu; Sheng Jin; Xiangtai Li; Wentao Liu; Chen Change Loy

Abstract
Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| open-vocabulary-object-detection-on-lvis-v1-0 | CLIPSelf | AP novel-LVIS base training: 34.9 |
| open-vocabulary-object-detection-on-mscoco | CLIPSelf | AP 0.5: 44.3 |
| open-vocabulary-panoptic-segmentation-on | CLIPSelf | PQ: 23.7 |
| open-vocabulary-semantic-segmentation-on-1 | CLIPSelf | mIoU: 62.3 |
| open-vocabulary-semantic-segmentation-on-2 | CLIPSelf | mIoU: 34.5 |
| open-vocabulary-semantic-segmentation-on-3 | CLIPSelf | mIoU: 12.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.