| ViT-Adapter-L (Mask2Former, BEiT pretrain) | 68.2 | Vision Transformer Adapter for Dense Predictions | |
| ViT-Adapter-L (UperNet, BEiT pretrain) | 67.5 | Vision Transformer Adapter for Dense Predictions | |
| CAA + CAR (ConvNeXt-Large + JPU) | 64.1 | CAR: Class-aware Regularizations for Semantic Segmentation | |
| Sequential Ensemble (Segformer + HRNet) | 62.1 | Sequential Ensembling for Semantic Segmentation | - |
| HRNetV2 + OCR + RMI (PaddleClas pretrained) | 59.6 | Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation | |