| M3I Pre-training (InternImage-H) | - | - | - | - | - | 65.0 | Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information | |
| Focal-Stable-DINO (Focal-Huge, no TTA) | 81.5 | 71.4 | 78.5 | 68.5 | 50.4 | 64.6 | A Strong and Reproducible Object Detector with Only Public Datasets | |
| FocalNet-H (DINO) | - | - | - | - | - | 64.2 | Focal Modulation Networks | |
| CP-DETR-L Swin-L(Fine tuning separately in COCO) | - | - | - | - | - | 64.1 | CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection | - |
| RevCol-H(DINO) | - | - | - | - | - | 63.8 | Reversible Column Networks | |
| ViTDet, ViT-H Cascade (multiscale) | - | - | - | - | - | 61.3 | Exploring Plain Vision Transformer Backbones for Object Detection | |
| GLIP (Swin-L, multi-scale) | - | - | - | - | - | 60.8 | Grounded Language-Image Pre-training | |
| Soft Teacher + Swin-L (HTC++, multi-scale) | - | - | - | - | - | 60.7 | End-to-End Semi-Supervised Object Detection with Soft Teacher | |