| M3I Pre-training (InternImage-H) | - | 1310 | 62.9 | Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information | |
| ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) | - | 571 | 61.5 | Vision Transformer Adapter for Dense Predictions | |
| RevCol-H (Mask2Former) | - | 2439 | 61.0 | Reversible Column Networks | |
| ViT-Adapter-L (Mask2Former, BEiT pretrain) | - | 571 | 60.5 | Vision Transformer Adapter for Dense Predictions | |
| DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) | - | 1080 | 60.2 | DINOv2: Learning Robust Visual Features without Supervision | |
| FocalNet-L (Mask2Former) | - | - | 58.5 | Focal Modulation Networks | |
| ViT-Adapter-L (UperNet, BEiT pretrain) | - | 451 | 58.4 | Vision Transformer Adapter for Dense Predictions | |
| SeMask (SeMask Swin-L MSFaPN-Mask2Former) | - | - | 58.2 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
| SeMask (SeMask Swin-L FaPN-Mask2Former) | - | - | 58.2 | SeMask: Semantically Masked Transformers for Semantic Segmentation | |
| DiNAT-L (Mask2Former) | - | - | 58.1 | Dilated Neighborhood Attention Transformer | |