| OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896) | 60.8 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| ViT-Adapter-L (Mask2Former, BEiT pretrain) | 60.5 | Vision Transformer Adapter for Dense Predictions | |
| OneFormer (DiNAT-L, multi-scale, 896x896) | 58.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| ViT-Adapter-L (UperNet, BEiT pretrain) | 58.4 | Vision Transformer Adapter for Dense Predictions | |
| OneFormer (DiNAT-L, multi-scale, 640x640) | 58.4 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OneFormer (Swin-L, multi-scale, 896x896) | 58.3 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| SeMask (SeMask Swin-L FaPN-Mask2Former) | 58.2 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
| SeMask (SeMask Swin-L MSFaPN-Mask2Former) | 58.2 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
| DiNAT-L (Mask2Former) | 58.1 | Dilated Neighborhood Attention Transformer | |
| Mask2Former (Swin-L-FaPN, multiscale) | 57.7 | Masked-attention Mask Transformer for Universal Image Segmentation | |
| OneFormer (Swin-L, multi-scale, 640x640) | 57.7 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| SeMask (SeMask Swin-L Mask2Former) | 57.5 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |