Command Palette
Search for a command to run...
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
Bo Miao; Mohammed Bennamoun; Yongsheng Gao; Mubarak Shah; Ajmal Mian

Abstract
Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| referring-expression-segmentation-on-davis | HTR | Ju0026F 1st frame: 65.6 |
| referring-expression-segmentation-on-refer-1 | HTR (Pre-training) | F: 68.9 J: 65.3 Ju0026F: 67.1 |
| referring-video-object-segmentation-on-mevis | HTR | F: 45.5 J: 39.9 Ju0026F: 42.7 |
| referring-video-object-segmentation-on-refer | HTR | F: 68.9 J: 65.3 Ju0026F: 67.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.