Command Palette
Search for a command to run...
Controlling Vision-Language Models for Multi-Task Image Restoration
Luo Ziwei ; Gustafsson Fredrik K. ; Zhao Zheng ; Sjölund Jens ; Schön Thomas B.

Abstract
Vision-language models such as CLIP have shown great impact on diversedownstream tasks for zero-shot or label-free predictions. However, when itcomes to low-level vision such as image restoration their performancedeteriorates dramatically due to corrupted inputs. In this paper, we present adegradation-aware vision-language model (DA-CLIP) to better transfer pretrainedvision-language models to low-level vision tasks as a multi-task framework forimage restoration. More specifically, DA-CLIP trains an additional controllerthat adapts the fixed CLIP image encoder to predict high-quality featureembeddings. By integrating the embedding into an image restoration network viacross-attention, we are able to pilot the model to learn a high-fidelity imagereconstruction. The controller itself will also output a degradation featurethat matches the real corruptions of the input, yielding a natural classifierfor different degradation types. In addition, we construct a mixed degradationdataset with synthetic captions for DA-CLIP training. Our approach advancesstate-of-the-art performance on both \emph{degradation-specific} and\emph{unified} image restoration tasks, showing a promising direction ofprompting image restoration with large-scale pretrained vision-language models.Our code is available at https://github.com/Algolzw/daclip-uir.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-dehazing-on-reside-6k | DA-CLIP | PSNR: 30.16 SSIM: 0.936 |
| low-light-image-enhancement-on-lol | DA-CLIP | Average PSNR: 23.77 LPIPS: 0.083 SSIM: 0.830 |
| single-image-deraining-on-rain100h | DA-CLIP | PSNR: 33.91 SSIM: 0.926 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.