| DVIS++(VIT-L, Online) | 88.8 | 75.3 | 57.9 | 73.7 | 67.7 | DVIS++: Improved Decoupled Framework for Universal Video Segmentation | |
| Mask2Former (Swin-L) | 84.4 | 67.0 | - | - | 60.4 | Mask2Former for Video Instance Segmentation | |
| SeqFormer (Swin-L) | 82.1 | 66.4 | 51.7 | 64.4 | 59.3 | SeqFormer: Sequential Transformer for Video Instance Segmentation | |
| InstanceFormer(Swin-L) | 78.0 | 64.2 | 50.9 | 61.6 | 56.3 | InstanceFormer: An Online Video Instance Segmentation Framework | |
| Video K-Net (Swin-Base) | 79.0 | 59.6 | 49.7 | 59.9 | 54.1 | Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation | |
| IDOL (ResNet-50) | 74 | 52.9 | 47.7 | 58.7 | 49.5 | In Defense of Online Models for Video Instance Segmentation | |
| Mask2Former (ResNet-101) | 72.8 | 54.2 | - | - | 49.2 | Mask2Former for Video Instance Segmentation | |
| SeqFormer (ResNet-101) | 71.1 | 55.7 | 46.8 | 56.9 | 49.0 | SeqFormer: Sequential Transformer for Video Instance Segmentation | |
| SeqFormer (ResNet-50) | 69.8 | 51.8 | 45.5 | 54.8 | 47.4 | SeqFormer: Sequential Transformer for Video Instance Segmentation | |
| Mask2Former (ResNet-50) | 68.0 | 50.0 | - | - | 46.4 | Mask2Former for Video Instance Segmentation | |
| InstanceFormer(ResNet-50) | 68.6 | 49.6 | 42.1 | 53.5 | 45.6 | InstanceFormer: An Online Video Instance Segmentation Framework | |