| Cutie+ (base, MEGA) | 90.6 | 90.5 | - | 86.3 | 82.7 | 87.5 | Putting the Object Back into Video Object Segmentation | |
| SwinB-AOTv2-L (all frames, MS) | 90.3 | 89.1 | - | 85.5 | 81.0 | 86.5 | Scalable Video Object Segmentation with Identification Mechanism | |
| DEVA | 89.9 | 89.1 | 25.3 | 85.4 | 89.9 | 86.2 | Tracking Anything in High Quality | |
| SwinB-AOTv2-L (all frames) | 88.9 | 88.0 | - | 84.2 | 79.8 | 85.2 | Scalable Video Object Segmentation with Identification Mechanism | |
| STCN + TrickVOS (PT) | 86.4 | 85.5 | - | 82.1 | 77.2 | - | TrickVOS: A Bag of Tricks for Video Object Segmentation | - |
| R50-AOST (L'=1) | 85.6 | 83.8 | - | 81.0 | 754.8 | 81.5 | Scalable Video Object Segmentation with Identification Mechanism | |