| Co-DETR | 80.2 | 63.4 | 72.0 | 60.1 | 41.6 | 57.1 | DETRs with Collaborative Hybrid Assignments Training | |
| CBNetV2 (EVA02, single-scale) | 80.3 | 62.1 | 70.9 | 59.3 | 39.7 | 56.1 | CBNet: A Composite Backbone Network Architecture for Object Detection | |
| Mask Frozen-DETR | 79.3 | 61.4 | 70.4 | 58.4 | 37.8 | 55.3 | Mask Frozen-DETR: High Quality Instance Segmentation with One GPU | - |
| ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | - | - | - | - | - | 54.5 | Vision Transformer Adapter for Dense Predictions | |
| ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | - | - | - | - | - | 53.0 | Vision Transformer Adapter for Dense Predictions | |
| Soft Teacher + Swin-L (HTC++, multi-scale) | - | - | - | - | - | 53.0 | End-to-End Semi-Supervised Object Detection with Soft Teacher | |
| ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | - | - | - | - | - | 52.5 | Vision Transformer Adapter for Dense Predictions | |
| CBNetV2 (Dual-Swin-L HTC, multi-scale) | - | - | - | - | - | 52.3 | CBNet: A Composite Backbone Network Architecture for Object Detection | |
| CBNetV2 (Dual-Swin-L HTC, single-scale) | - | - | - | - | - | 51.6 | CBNet: A Composite Backbone Network Architecture for Object Detection | |
| Focal-L (HTC++, multi-scale) | 75.4 | 56.5 | 64.2 | - | 35.6 | 51.3 | Focal Self-attention for Local-Global Interactions in Vision Transformers | |
| Swin-L (HTC++, multi scale) | - | - | - | - | - | 51.1 | Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | |