| GLIP-L
(Swin-L) | 48.0 | 24.89 | Grounded Language-Image Pre-training | |
| ConvNeXt-XL
(Cascade Mask R-CNN) | 37.5 | 12.68 | A ConvNet for the 2020s | |
| ViT-Adapter (BEiTv2-L) | 34.25 | 7.79 | Vision Transformer Adapter for Dense Predictions | |
| Det-AdvProp
(EfficientNet-B5) | 30.8 | 7.34 | Robust and Accurate Object Detection via Adversarial Learning | |
| CenterNet2
(R2-101-DCN) | 29.5 | 4.29 | Probabilistic two-stage detection | |