| OneFormer (ConvNeXt-L, single-scale, 512x1024, Mapillary Vistas-pretrained) | 48.7 | 70.1 | 74.1 | 64.6 | 84.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| Panoptic-DeepLab (SWideRNet [1, 1, 4.5], Mapillary Vistas, multi-scale) | 46.8 | 69.6 | - | - | 85.3 | Scaling Wide Residual Networks for Panoptic Segmentation | - |
| OneFormer (ConvNeXt-L, single-scale) | 46.5 | 68.51 | - | - | 83.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| Axial-DeepLab-XL (Mapillary Vistas, multi-scale) | 44.2 | 68.5 | - | - | 84.6 | Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | |
| Panoptic-DeepLab (SWideRNet [1, 1, 4.5], Mapillary Vistas, single-scale) | 42.8 | 68.5 | - | - | 84.6 | Scaling Wide Residual Networks for Panoptic Segmentation | - |
| OneFormer (ConvNeXt-XL, single-scale) | 46.7 | 68.4 | - | - | 83.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| kMaX-DeepLab (single-scale) | 44.0 | 68.4 | - | - | 83.5 | kMaX-DeepLab: k-means Mask Transformer | |
| AFF-Base (single-scale, point-based Mask2Former) | 46.2 | 67.7 | 71.5 | 62.5 | 83.0 | AutoFocusFormer: Image Segmentation off the Grid | |
| OneFormer (DiNAT-L, single-scale) | 45.6 | 67.6 | - | - | 83.1 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| EfficientPS | 43.5 | 67.5 | 70.3 | 63.2 | 82.1 | EfficientPS: Efficient Panoptic Segmentation | |
| DiNAT-L (Mask2Former) | 44.5 | 67.2 | - | - | 83.4 | Dilated Neighborhood Attention Transformer | |
| OneFormer (Swin-L, single-scale) | 45.6 | 67.2 | - | - | 83.0 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| AFF-Small (single-scale, point-based Mask2Former) | 44.2 | 66.9 | 70.8 | 61.5 | 82.2 | AutoFocusFormer: Image Segmentation off the Grid | |
| Mask2Former (Swin-L) | 43.6 | 66.6 | - | - | 82.9 | Masked-attention Mask Transformer for Universal Image Segmentation | |
| EfficientPS (Cityscapes-fine) | 39.1 | 64.9 | 67.7 | 61.0 | 90.3 | EfficientPS: Efficient Panoptic Segmentation | |
| CMT-DeepLab (MaX-S, single-scale, IN-1K) | - | 64.6 | - | - | 81.4 | CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation | |
| Panoptic-DeepLab (X71) | 38.5 | 64.1 | - | - | 81.5 | Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation | |
| Mask2Former + Intra-Batch Supervision (ResNet-50) | - | 62.4 | 67.3 | 54.7 | - | Intra-Batch Supervision for Panoptic Segmentation on High-Resolution Images | |
| COPS (ResNet-50) | 34.1 | 62.1 | 67.2 | 55.1 | 79.3 | Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach | |
| AdaptIS (ResNeXt-101) | 36.3 | 62.0 | 64.4 | 58.7 | 79.2 | AdaptIS: Adaptive Instance Selection Network | - |