Command Palette
Search for a command to run...
René Ranftl Alexey Bochkovskiy Vladlen Koltun

Abstract
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| monocular-depth-estimation-on-eth3d | DPT | Delta u003c 1.25: 0.0946 absolute relative error: 0.078 |
| monocular-depth-estimation-on-kitti-eigen | DPT-Hybrid | Delta u003c 1.25: 0.959 Delta u003c 1.25^2: 0.995 Delta u003c 1.25^3: 0.999 RMSE: 2.573 RMSE log: 0.092 absolute relative error: 0.062 |
| monocular-depth-estimation-on-nyu-depth-v2 | DPT-Hybrid | Delta u003c 1.25: 0.904 Delta u003c 1.25^2: 0.988 Delta u003c 1.25^3: 0.994 RMSE: 0.357 absolute relative error: 0.110 log 10: 0.045 |
| semantic-segmentation-on-ade20k | DPT-Hybrid | Validation mIoU: 49.02 |
| semantic-segmentation-on-ade20k-val | DPT-Hybrid | Pixel Accuracy: 83.11 mIoU: 49.02 |
| semantic-segmentation-on-pascal-context | DPT-Hybrid | mIoU: 60.46 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.