| OneFormer (ConvNeXt-XL, Mapillary, multi-scale) | 85.8 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| SeMask (SeMask Swin-L Mask2Former) | 84.98 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
| Sequential Ensemble (MiT-B5 + HRNet) | 84.8 | Sequential Ensembling for Semantic Segmentation | - |
| OneFormer (ConvNeXt-XL, multi-scale) | 84.6 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| DiNAT-L (Mask2Former) | 84.5 | Dilated Neighborhood Attention Transformer | |
| OneFormer (Swin-L, multi-scale) | 84.4 | OneFormer: One Transformer to Rule Universal Image Segmentation | |
| VOLO-D4 (MS, ImageNet1k pretrain) | 84.3 | VOLO: Vision Outlooker for Visual Recognition | |
| DDP (ConvNeXt-L, step-3) | 83.9 | DDP: Diffusion Model for Dense Visual Prediction | |