5 months ago

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu; Wenwei Zhang; Lumin Xu; Sheng Jin; Xiangtai Li; Wentao Liu; Chen Change Loy

Abstract

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

Code Repositories

wusize/clipself

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
open-vocabulary-object-detection-on-lvis-v1-0	CLIPSelf	AP novel-LVIS base training: 34.9
open-vocabulary-object-detection-on-mscoco	CLIPSelf	AP 0.5: 44.3
open-vocabulary-panoptic-segmentation-on	CLIPSelf	PQ: 23.7
open-vocabulary-semantic-segmentation-on-1	CLIPSelf	mIoU: 62.3
open-vocabulary-semantic-segmentation-on-2	CLIPSelf	mIoU: 34.5
open-vocabulary-semantic-segmentation-on-3	CLIPSelf	mIoU: 12.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette