| ViT-Adapter-L (Mask2Former, BEiT pretrain) | 85.2% | Vision Transformer Adapter for Dense Predictions | |
| Euclidean Frank-Wolfe CRFs (backbone: DeepLabv3+)(coarse) | 83.6% | Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and Beyond | |
| ResNeSt200 (Mapillary) | 83.3% | ResNeSt: Split-Attention Networks | |
| HANet (Height-driven Attention Networks by LGE A&B)(coarse) | 83.2% | Cars Can't Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention Networks | |
| kMaX-DeepLab (ConvNeXt-L, fine only) | 83.2% | kMaX-DeepLab: k-means Mask Transformer | |