8 months ago

Computer Vision

Image Recognition

Method/Architecture

Computer Vision

Kazumoto Nakamura Yuji Nozawa Yu-Chieh Lin Kengo Nakata Youyang Ng

Abstract

The goal of this paper is to improve the performance of pretrained VisionTransformer (ViT) models, particularly DINOv2, in image clustering task withoutrequiring re-training or fine-tuning. As model size increases, high-normartifacts anomaly appears in the patches of multi-head attention. We observethat this anomaly leads to reduced accuracy in zero-shot image clustering.These artifacts are characterized by disproportionately large values in theattention map compared to other patch tokens. To address these artifacts, wepropose an approach called Inference-Time Attention Engineering (ITAE), whichmanipulates attention function during inference. Specifically, we identify theartifacts by investigating one of the Query-Key-Value (QKV) patches in themulti-head attention and attenuate their corresponding attention values insidethe pretrained models. ITAE shows improved clustering accuracy on multipledatasets by exhibiting more expressive features in latent space. Our findingshighlight the potential of ITAE as a practical solution for reducing artifactsin pretrained ViT models and improving model performance in clustering taskswithout the need for re-training or fine-tuning.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Computer Vision

Image Recognition

Method/Architecture

Computer Vision

Kazumoto Nakamura Yuji Nozawa Yu-Chieh Lin Kengo Nakata Youyang Ng

Abstract

The goal of this paper is to improve the performance of pretrained VisionTransformer (ViT) models, particularly DINOv2, in image clustering task withoutrequiring re-training or fine-tuning. As model size increases, high-normartifacts anomaly appears in the patches of multi-head attention. We observethat this anomaly leads to reduced accuracy in zero-shot image clustering.These artifacts are characterized by disproportionately large values in theattention map compared to other patch tokens. To address these artifacts, wepropose an approach called Inference-Time Attention Engineering (ITAE), whichmanipulates attention function during inference. Specifically, we identify theartifacts by investigating one of the Query-Key-Value (QKV) patches in themulti-head attention and attenuate their corresponding attention values insidethe pretrained models. ITAE shows improved clustering accuracy on multipledatasets by exhibiting more expressive features in latent space. Our findingshighlight the potential of ITAE as a practical solution for reducing artifactsin pretrained ViT models and improving model performance in clustering taskswithout the need for re-training or fine-tuning.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp