Instance Segmentation On Coco

评估指标

AP50
AP75
APL
APM
APS
mask AP

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Co-DETR80.263.472.060.141.657.1DETRs with Collaborative Hybrid Assignments Training
CBNetV2 (EVA02, single-scale)80.362.170.959.339.756.1CBNet: A Composite Backbone Network Architecture for Object Detection
EVA80.0-72.458.036.355.5EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
FD-SwinV2-G-----55.4Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
Mask Frozen-DETR79.361.470.458.437.855.3Mask Frozen-DETR: High Quality Instance Segmentation with One GPU-
BEiT-3-----54.8Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
MasK DINO (SwinL, multi-scale)-----54.7Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
GLEE-Pro-----54.5General Object Foundation Model for Images and Videos at Scale
ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)-----54.5Vision Transformer Adapter for Dense Predictions
SwinV2-G (HTC++)-----54.4Swin Transformer V2: Scaling Up Capacity and Resolution
GLEE-Plus-----53.3General Object Foundation Model for Images and Videos at Scale
ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)-----53.0Vision Transformer Adapter for Dense Predictions
Soft Teacher + Swin-L (HTC++, multi-scale)-----53.0End-to-End Semi-Supervised Object Detection with Soft Teacher
Mask DINO (SwinL, single -scale)-----52.8Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)-----52.5Vision Transformer Adapter for Dense Predictions
CBNetV2 (Dual-Swin-L HTC, multi-scale)-----52.3CBNet: A Composite Backbone Network Architecture for Object Detection
UNINEXT-H76.256.767.555.933.351.8Universal Instance Perception as Object Discovery and Retrieval
CBNetV2 (Dual-Swin-L HTC, single-scale)-----51.6CBNet: A Composite Backbone Network Architecture for Object Detection
Focal-L (HTC++, multi-scale)75.456.564.2-35.651.3Focal Self-attention for Local-Global Interactions in Vision Transformers
Swin-L (HTC++, multi scale)-----51.1Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
0 of 112 row(s) selected.