3 months ago

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng Jiachen Lu Hengshuang Zhao Xiatian Zhu Zekun Luo Yabiao Wang Yanwei Fu Jianfeng Feng Tao Xiang Philip H.S. Torr

Abstract

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

Code Repositories

PaddlePaddle/PaddleSeg

paddle

fudan-zvg/SETR

Official

pytorch

Mentioned in GitHub

gupta-abhay/setr-pytorch

pytorch

Mentioned in GitHub

BR-IDL/PaddleViT/tree/main/semantic_segmentation

paddle

920232796/setr-pytorch

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
medical-image-segmentation-on-synapse-multi	SETR	Avg DSC: 79.60
semantic-segmentation-on-ade20k	SETR-MLA (160k, MS)	Validation mIoU: 50.28
semantic-segmentation-on-cityscapes	SETR-PUP++	Mean IoU (class): 81.64%
semantic-segmentation-on-cityscapes-val	SETR-PUP (80k, MS)	mIoU: 82.15
semantic-segmentation-on-dada-seg	SETR (PUP, Transformer-Large)	mIoU: 31.8
semantic-segmentation-on-dada-seg	SETR (MLA, Transformer-Large)	mIoU: 30.4
semantic-segmentation-on-densepass	SETR (MLA, Transformer-L)	mIoU: 35.6%
semantic-segmentation-on-densepass	SETR (PUP, Transformer-L)	mIoU: 35.7%
semantic-segmentation-on-foodseg103	SeTR-MLA (ViT-16/B)	mIoU: 45.1
semantic-segmentation-on-foodseg103	SeTR-Naive (ViT-16/B)	mIoU: 41.3
semantic-segmentation-on-pascal-context	SETR-MLA (16, 80k, MS)	mIoU: 55.83
semantic-segmentation-on-urbanlf	SETR (ViT-Large)	mIoU (Real): 77.74 mIoU (Syn): 77.69

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng Jiachen Lu Hengshuang Zhao Xiatian Zhu Zekun Luo Yabiao Wang Yanwei Fu Jianfeng Feng Tao Xiang Philip H.S. Torr1 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Sixiao Zheng Jiachen Lu Hengshuang Zhao Xiatian Zhu Zekun Luo Yabiao Wang Yanwei Fu Jianfeng Feng Tao Xiang Philip H.S. Torr