Command Palette
Search for a command to run...
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Abstract
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| medical-image-segmentation-on-synapse-multi | SETR | Avg DSC: 79.60 |
| semantic-segmentation-on-ade20k | SETR-MLA (160k, MS) | Validation mIoU: 50.28 |
| semantic-segmentation-on-cityscapes | SETR-PUP++ | Mean IoU (class): 81.64% |
| semantic-segmentation-on-cityscapes-val | SETR-PUP (80k, MS) | mIoU: 82.15 |
| semantic-segmentation-on-dada-seg | SETR (PUP, Transformer-Large) | mIoU: 31.8 |
| semantic-segmentation-on-dada-seg | SETR (MLA, Transformer-Large) | mIoU: 30.4 |
| semantic-segmentation-on-densepass | SETR (MLA, Transformer-L) | mIoU: 35.6% |
| semantic-segmentation-on-densepass | SETR (PUP, Transformer-L) | mIoU: 35.7% |
| semantic-segmentation-on-foodseg103 | SeTR-MLA (ViT-16/B) | mIoU: 45.1 |
| semantic-segmentation-on-foodseg103 | SeTR-Naive (ViT-16/B) | mIoU: 41.3 |
| semantic-segmentation-on-pascal-context | SETR-MLA (16, 80k, MS) | mIoU: 55.83 |
| semantic-segmentation-on-urbanlf | SETR (ViT-Large) | mIoU (Real): 77.74 mIoU (Syn): 77.69 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.