Command Palette
Search for a command to run...
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark
{Joon-Young Lee Seonguk Seo Bohyung Han}

Abstract
We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the {object masks} referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at url{https://github.com/skynbe/Refer-Youtube-VOS}.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| referring-expression-segmentation-on-davis | URVOS + Refer-Youtube-VOS + ft. DAVIS | Ju0026F 1st frame: 51.63 |
| referring-expression-segmentation-on-davis | URVOS + Refer-Youtube-VOS | Ju0026F 1st frame: 46.85 |
| referring-expression-segmentation-on-davis | URVOS | Ju0026F 1st frame: 44.1 |
| referring-expression-segmentation-on-refer-1 | URVOS | F: 50.8 J: 47.0 Ju0026F: 48.9 |
| referring-video-object-segmentation-on-mevis | URVOS | F: 29.9 J: 25.7 Ju0026F: 27.8 |
| referring-video-object-segmentation-on-ref | URVOS | F: 56.0 J: 47.3 Ju0026F: 51.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.