HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Qiang Wan Zilong Huang Bingyi Kang Jiashi Feng Li Zhang

Harnessing Diffusion Models for Visual Perception with Meta Prompts

Abstract

The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.

Code Repositories

fudan-zvg/meta-prompts
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
monocular-depth-estimation-on-kitti-eigenMetaPrompt-SD
Delta u003c 1.25: 0.981
Delta u003c 1.25^2: 0.998
Delta u003c 1.25^3: 1.000
RMSE: 1.928
RMSE log: 0.071
Sq Rel: 0.125
absolute relative error: 0.047
monocular-depth-estimation-on-nyu-depth-v2MetaPrompt-SD
Delta u003c 1.25: 0.976
Delta u003c 1.25^2: 0.997
Delta u003c 1.25^3: 0.999
RMSE: 0.223
absolute relative error: 0.061
log 10: 0.027
pose-estimation-on-cocoMetaPrompt-SD
AP: 79.0
semantic-segmentation-on-ade20kMetaPrompt-SD
Validation mIoU: 56.8
semantic-segmentation-on-cityscapesMetaPrompt-SD
Mean IoU (class): 86.2
semantic-segmentation-on-cityscapes-valMetaPrompt-SD
mIoU: 87.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Harnessing Diffusion Models for Visual Perception with Meta Prompts | Papers | HyperAI