| OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896) | 40.2 | 54.5 | 60.4 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OpenSeed(SwinL, single scale, 1280x1280) | - | 53.7 | - | A Simple Framework for Open-Vocabulary Segmentation and Detection | |
| OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain) | - | 53.4 | 58.9 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 38.7 | 52.4 | 59.1 | Generalized Decoding for Pixel, Image, and Language | |
| OneFormer (DiNAT-L, single-scale, 1280x1280) | 37.1 | 51.5 | 58.3 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OneFormer (Swin-L, single-scale, 1280x1280) | 37.8 | 51.4 | 57.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281) | - | 50.9 | 55.2 | kMaX-DeepLab: k-means Mask Transformer | |
| OneFormer (DiNAT-L, single-scale, 640x640) | 36.0 | 50.5 | 58.3 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OneFormer (ConvNeXt-XL, single-scale, 640x640) | 36.3 | 50.1 | 57.4 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OneFormer (ConvNeXt-L, single-scale, 640x640) | 36.2 | 50.0 | 56.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OneFormer (Swin-L, single-scale, 640x640) | 35.9 | 49.8 | 57.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| X-Decoder (L) | 35.8 | 49.6 | 58.1 | Generalized Decoding for Pixel, Image, and Language | |
| DiNAT-L (Mask2Former, 640x640) | 35.0 | 49.4 | 56.3 | Dilated Neighborhood Attention Transformer | |
| kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641) | - | 48.7 | 54.8 | kMaX-DeepLab: k-means Mask Transformer | |
| Mask2Former (Swin-L) | 34.2 | 48.1 | 54.5 | Masked-attention Mask Transformer for Universal Image Segmentation | |
| Mask2Former (Swin-L + FAPN, 640x640) | 33.2 | 46.2 | 55.4 | Masked-attention Mask Transformer for Universal Image Segmentation | |
| kMaX-DeepLab (ResNet50, single-scale, 1281x1281) | - | 42.3 | 45.3 | kMaX-DeepLab: k-means Mask Transformer | |
| kMaX-DeepLab (ResNet50, single-scale, 641x641) | - | 41.5 | 45.0 | kMaX-DeepLab: k-means Mask Transformer | |
| Mask2Former (ResNet-50, 640x640) | - | 39.7 | - | Masked-attention Mask Transformer for Universal Image Segmentation | |
| Panoptic-DeepLab (SwideRNet) | - | 37.9 | 50 | Masked-attention Mask Transformer for Universal Image Segmentation | |