3 months ago

Vision Transformers for Dense Prediction

René Ranftl Alexey Bochkovskiy Vladlen Koltun

Abstract

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Code Repositories

mszpc/3d_dense

mindspore

antocad/FocusOnDepth

pytorch

Mentioned in GitHub

isl-org/MiDaS

pytorch

Mentioned in GitHub

kritiksoman/GIMP-ML

pytorch

alexeyab/midas

pytorch

Mentioned in GitHub

vishal-kataria/MiDaS-master

pytorch

Mentioned in GitHub

EPFL-VILAB/3DCommonCorruptions

pytorch

Mentioned in GitHub

Expedit-LargeScale-Vision-Transformer/Expedit-DPT

pytorch

Mentioned in GitHub

huggingface/transformers

pytorch

Mentioned in GitHub

BR-IDL/PaddleViT/tree/main/semantic_segmentation

paddle

chriswxho/dynamic-inference

pytorch

Mentioned in GitHub

SforAiDl/vformer

pytorch

Mentioned in GitHub

intel-isl/MiDaS

pytorch

Mentioned in GitHub

ahmedmostafa0x61/Depth_Estimation

pytorch

Mentioned in GitHub

danielzgsilva/MonoDepthAttacks

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
monocular-depth-estimation-on-eth3d	DPT	Delta u003c 1.25: 0.0946 absolute relative error: 0.078
monocular-depth-estimation-on-kitti-eigen	DPT-Hybrid	Delta u003c 1.25: 0.959 Delta u003c 1.25^2: 0.995 Delta u003c 1.25^3: 0.999 RMSE: 2.573 RMSE log: 0.092 absolute relative error: 0.062
monocular-depth-estimation-on-nyu-depth-v2	DPT-Hybrid	Delta u003c 1.25: 0.904 Delta u003c 1.25^2: 0.988 Delta u003c 1.25^3: 0.994 RMSE: 0.357 absolute relative error: 0.110 log 10: 0.045
semantic-segmentation-on-ade20k	DPT-Hybrid	Validation mIoU: 49.02
semantic-segmentation-on-ade20k-val	DPT-Hybrid	Pixel Accuracy: 83.11 mIoU: 49.02
semantic-segmentation-on-pascal-context	DPT-Hybrid	mIoU: 60.46

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Vision Transformers for Dense Prediction

René Ranftl Alexey Bochkovskiy Vladlen Koltun

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters