| Cutie+ (base, MEGA) | 91.0 | 90.1 | 86.6 | 82.2 | 87.5 | - | - | Putting the Object Back into Video Object Segmentation | |
| SwinB-AOTv2-L (all frames, MS) | 90.7 | 88.9 | 85.6 | 80.7 | 86.5 | 65.6 | - | Scalable Video Object Segmentation with Identification Mechanism | |
| R50-AOTv2-L (all frames) | 90.2 | 87.3 | 85.1 | 78.9 | 85.4 | 15.1 | - | Scalable Video Object Segmentation with Identification Mechanism | |
| SwinB-AOTv2-L (all frames) | 90.1 | 88.2 | - | 79.6 | 85.8 | - | - | Scalable Video Object Segmentation with Identification Mechanism | |
| SwinB-AOT-L (all frames) | 90.1 | 86.9 | 85.1 | 78.4 | 85.1 | 65.4 | 5.2 | Associating Objects with Transformers for Video Object Segmentation | |
| R50-AOT-L (all frames) | 89.5 | 88.2 | 84.5 | 79.6 | 85.5 | 14.9 | 6.4 | Associating Objects with Transformers for Video Object Segmentation | |
| SwinB-AOT-L | 89.3 | 86.4 | 84.3 | 77.9 | 84.5 | 65.4 | 9.3 | Associating Objects with Transformers for Video Object Segmentation | |
| R50-AOST (L'=3) | 88.8 | 87.9 | 83.8 | 79.3 | 85.0 | 15.4 | - | Scalable Video Object Segmentation with Identification Mechanism | |
| AOT-L (all frames) | 88.8 | 87.1 | 83.7 | 78.4 | 84.5 | 8.3 | 6.5 | Associating Objects with Transformers for Video Object Segmentation | |
| R50-AOST (L'=2) | 88.5 | 87.2 | 83.5 | 78.8 | 84.5 | 13.9 | - | Scalable Video Object Segmentation with Identification Mechanism | |
| AOT-B (all frames) | 88.5 | 86.5 | 83.6 | 78.0 | 84.1 | 8.3 | 20.5 | Associating Objects with Transformers for Video Object Segmentation | |