| ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | - | - | - | 54.2 | Vision Transformer Adapter for Dense Predictions | |
| ViTDet, ViT-H Cascade (multiscale) | - | - | - | 53.1 | Exploring Plain Vision Transformer Backbones for Object Detection | |
| ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | - | - | - | 52.5 | Vision Transformer Adapter for Dense Predictions | |
| Soft Teacher + Swin-L(HTC++, multi-scale) | - | - | - | 52.5 | End-to-End Semi-Supervised Object Detection with Soft Teacher | |
| ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | - | - | - | 52.2 | Vision Transformer Adapter for Dense Predictions | |
| Soft Teacher + Swin-L(HTC++, single-scale) | - | - | - | 51.9 | End-to-End Semi-Supervised Object Detection with Soft Teacher | |
| CBNetV2 (Dual-Swin-L HTC, multi-scale) | - | - | - | 51.8 | CBNet: A Composite Backbone Network Architecture for Object Detection | |
| Frozen Backbone, SwinV2-G-ext22K (HTC) | - | - | - | 51.6 | Could Giant Pretrained Image Models Extract Universal Representations? | - |
| CBNetV2 (Dual-Swin-L HTC, multi-scale) | - | - | - | 51 | CBNet: A Composite Backbone Network Architecture for Object Detection | |