Command Palette
Search for a command to run...
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Gonzalo Martin Garcia Karim Abou Zeid Christian Schmidt Daan de Geus Alexander Hermans Bastian Leibe

Abstract
Recent work showed that large diffusion models can be reused as highlyprecise monocular depth estimators by casting depth estimation as animage-conditional image generation task. While the proposed model achievedstate-of-the-art results, high computational demands due to multi-stepinference limited its use in many scenarios. In this paper, we show that theperceived inefficiency was caused by a flaw in the inference pipeline that hasso far gone unnoticed. The fixed model performs comparably to the bestpreviously reported configuration while being more than 200times faster. Tooptimize for downstream task performance, we perform end-to-end fine-tuning ontop of the single-step model with task-specific losses and get a deterministicmodel that outperforms all other diffusion-based depth and normal estimationmodels on common zero-shot benchmarks. We surprisingly find that thisfine-tuning protocol also works directly on Stable Diffusion and achievescomparable performance to current state-of-the-art diffusion-based depth andnormal estimation models, calling into question some of the conclusions drawnfrom prior works.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| monocular-depth-estimation-on-nyu-depth-v2 | Marigold + E2E FT(zero-shot) | Delta u003c 1.25: 0.966 absolute relative error: 0.052 |
| surface-normals-estimation-on-ibims-1 | Marigold + E2E FT(zero-shot) | % u003c 11.25: 69.9 Mean: 15.8 |
| surface-normals-estimation-on-nyu-depth-v2-1 | Marigold + E2E FT(zero-shot) | % u003c 11.25: 61.4 Mean Angle Error: 16.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.