| HyperSeg (Swin-B) | - | 61.2 | HyperSeg: Towards Universal Visual Segmentation with Large Language Model | |
| OneFormer (InternImage-H,single-scale) | 52.0 | 60.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| OpenSeeD (SwinL, single-scale) | 53.2 | 59.5 | A Simple Framework for Open-Vocabulary Segmentation and Detection | |
| MasK DINO (SwinL,single-scale) | 50.9 | 59.4 | Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | |
| DiNAT-L (single-scale, Mask2Former) | 49.2 | 58.5 | Dilated Neighborhood Attention Transformer | |
| ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) | 48.9 | 58.4 | Vision Transformer Adapter for Dense Predictions | |
| Visual Attention Network (VAN-B6 + Mask2Former) | - | 58.2 | Visual Attention Network | |
| kMaX-DeepLab (single-scale, pseudo-labels) | - | 58.1 | kMaX-DeepLab: k-means Mask Transformer | |
| HIPIE (ViT-H, single-scale) | - | 58.1 | Hierarchical Open-vocabulary Universal Image Segmentation | |
| kMaX-DeepLab (single-scale, drop query with 256 queries) | - | 58.0 | kMaX-DeepLab: k-means Mask Transformer | |
| OneFormer (DiNAT-L, single-scale) | 49.2 | 58.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| FocalNet-L (Mask2Former (200 queries)) | 48.4 | 57.9 | Focal Modulation Networks | |
| kMaX-DeepLab (single-scale) | - | 57.9 | kMaX-DeepLab: k-means Mask Transformer | |
| OneFormer (Swin-L, single-scale) | 49.0 | 57.9 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| Mask2Former (single-scale) | 48.6 | 57.8 | Masked-attention Mask Transformer for Universal Image Segmentation | |
| Panoptic SegFormer (single-scale) | - | 55.8 | Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers | |
| CMT-DeepLab (single-scale) | - | 55.3 | CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation | |
| MaskFormer (single-scale) | - | 52.7 | Per-Pixel Classification is Not All You Need for Semantic Segmentation | |
| MaX-DeepLab-L (single-scale) | - | 51.1 | MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers | |
| Panoptic SegFormer (ResNet-101) | - | 50.6 | Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers | |