3 months ago

Exploring the Limits of Deep Image Clustering using Pretrained Models

Nikolas Adaloglou Felix Michels Hamza Kalisch Markus Kollmann

Abstract

We present a general methodology that learns to classify images without labels by leveraging pretrained feature extractors. Our approach involves self-distillation training of clustering heads based on the fact that nearest neighbours in the pretrained feature space are likely to share the same label. We propose a novel objective that learns associations between image features by introducing a variant of pointwise mutual information together with instance weighting. We demonstrate that the proposed objective is able to attenuate the effect of false positive pairs while efficiently exploiting the structure in the pretrained feature space. As a result, we improve the clustering accuracy over $k$-means on $17$ different pretrained models by $6.1$\% and $12.2$\% on ImageNet and CIFAR100, respectively. Finally, using self-supervised vision transformers, we achieve a clustering accuracy of $61.6$\% on ImageNet. The code is available at https://github.com/HHU-MMBS/TEMI-official-BMVC2023.

Code Repositories

HHU-MMBS/TEMI-official-BMVC2023

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
image-clustering-on-cifar-10	TEMI DINO ViT-B	ARI: 0.885 Accuracy: 0.94.5 Backbone: ViT-B NMI: 0.886 Train set: Train
image-clustering-on-cifar-10	TEMI CLIP ViT-L (openai)	ARI: 0.932 Accuracy: 0.969 Backbone: ViT-L NMI: 0.926 Train set: Train
image-clustering-on-cifar-100	TEMI DINO ViT-B	ARI: 0.533 Accuracy: 0.671 NMI: 0.769 Train Set: Train
image-clustering-on-cifar-100	TEMI CLIP ViT-L (openai)	ARI: 0.612 Accuracy: 0.737 NMI: 0.799 Train Set: Train
image-clustering-on-imagenet	TEMI DINO (ViT-B)	ARI: 45.9 Accuracy: 58.0 NMI: 81.4
image-clustering-on-imagenet	TEMI MSN (ViT-L)	ARI: 48.4 Accuracy: 61.6 NMI: 82.5
image-clustering-on-imagenet-100	TEMI CLIP ViT-L (openai)	ACCURACY: 0.8343 ARI: 0.7581 NMI: 0.9006
image-clustering-on-imagenet-100	TEMI MSN ViT-L	ACCURACY: 0.8286 ARI: 0.7408 NMI: 0.8853
image-clustering-on-imagenet-100	TEMI DINO ViT-B	ACCURACY: 0.7505 ARI: 0.6545 NMI: 0.8565
image-clustering-on-imagenet-200	TEMI CLIP ViT-L (openai)	-
image-clustering-on-imagenet-200	TEMI DINO ViT-B	-
image-clustering-on-imagenet-200	TEMI MSN ViT-L	-
image-clustering-on-imagenet-50-1	TEMI DINO ViT-B	ACCURACY: 0.801 ARI: 0.7093 NMI: 0.8610
image-clustering-on-imagenet-50-1	TEMI CLIP ViT-L (openai)	ACCURACY: 0.8827 ARI: 0.8272 NMI: 0.9232
image-clustering-on-imagenet-50-1	TEMI MSN ViT-L	ACCURACY: 0.8487 ARI: 0.7646 NMI: 0.8814
image-clustering-on-stl-10	TEMI DINO ViT-B	ARI: 0.968 Accuracy: 0.985 Backbone: ViT-B NMI: 0.965 Train Split: Train

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette