| SOC (Video-Swin-B) | 0.573 | 0.725 | 0.807 | 0.851 | 0.827 | 0.765 | 0.607 | 0.252 | SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | |
| SgMg (Video-Swin-B) | 0.585 | 0.720 | 0.799 | 0.843 | 0.822 | 0.767 | 0.617 | 0.259 | Spectrum-guided Multi-granularity Referring Video Object Segmentation | |
| ReferFormer (Video-Swin-B) | 0.550 | 0.703 | 0.786 | 0.831 | 0.804 | 0.741 | 0.579 | 0.212 | Language as Queries for Referring Video Object Segmentation | |
| SOC (Video-Swin-T) | 0.504 | 0.669 | 0.747 | 0.79 | 0.756 | 0.687 | 0.535 | 0.195 | SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation | |
| ClawCraneNet | - | 0.655 | 0.644 | 0.704 | 0.677 | 0.617 | 0.489 | 0.171 | ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation | - |
| MTTR (w=10) | 0.461 | 0.64 | 0.72 | 0.754 | 0.712 | 0.638 | 0.485 | 0.169 | End-to-End Referring Video Object Segmentation with Multimodal Transformers | |
| MANET | 0.471 | 0.632 | 0.726 | 0.734 | 0.682 | 0.579 | 0.389 | 0.132 | Multi-Attention Network for Compressed Video Referring Object Segmentation | |
| MTTR (w=8) | 0.447 | 0.618 | 0.702 | 0.721 | 0.684 | 0.607 | 0.456 | 0.164 | End-to-End Referring Video Object Segmentation with Multimodal Transformers | |
| VLIDE | 0.469 | 0.598 | 0.714 | 0.702 | 0.663 | 0.585 | 0.428 | 0.151 | Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation | - |
| Locater | 0.465 | 0.597 | 0.69 | 0.709 | 0.64 | 0.525 | 0.351 | 0.101 | Local-Global Context Aware Transformer for Language-Guided Video Segmentation | |
| CMPC-V (I3D) | 0.404 | 0.573 | 0.653 | 0.655 | 0.592 | 0.506 | 0.342 | 0.098 | Cross-Modal Progressive Comprehension for Referring Segmentation | |
| Hui et al. | 0.399 | 0.561 | 0.662 | 0.654 | 0.589 | 0.497 | 0.333 | 0.091 | Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation | - |
| mmmmtbvs | 0.419 | 0.558 | 0.673 | 0.645 | 0.597 | 0.523 | 0.375 | 0.13 | Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation | |
| AAMN | 0.396 | 0.552 | 0.617 | 0.681 | 0.629 | 0.523 | 0.296 | 0.029 | Actor and Action Modular Network for Text-based Video Segmentation | - |
| PRPE | 0.388 | 0.529 | 0.661 | 0.634 | 0.579 | 0.483 | 0.322 | 0.083 | Polar Relative Positional Encoding for Video-Language Segmentation | - |
| CMPC-V (R2D) | 0.351 | 0.515 | 0.649 | 0.590 | 0.527 | 0.434 | 0.284 | 0.068 | Cross-Modal Progressive Comprehension for Referring Segmentation | |