HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Language as Queries for Referring Video Object Segmentation

Jiannan Wu Yi Jiang Peize Sun Zehuan Yuan Ping Luo

Language as Queries for Referring Video Object Segmentation

Abstract

Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames. In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer. On Ref-Youtube-VOS, Refer-Former achieves 55.6J&F with a ResNet-50 backbone without bells and whistles, which exceeds the previous state-of-the-art performance by 8.4 points. In addition, with the strong Swin-Large backbone, ReferFormer achieves the best J&F of 64.2 among all existing methods. Moreover, we show the impressive results of 55.0 mAP and 43.7 mAP on A2D-Sentences andJHMDB-Sentences respectively, which significantly outperforms the previous methods by a large margin. Code is publicly available at https://github.com/wjn922/ReferFormer.

Code Repositories

wjn922/referformer
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
referring-expression-segmentation-on-a2dReferFormer (Video-Swin-B)
AP: 0.550
IoU mean: 0.703
IoU overall: 0.786
Precision@0.5: 0.831
Precision@0.6: 0.804
Precision@0.7: 0.741
Precision@0.8: 0.579
Precision@0.9: 0.212
referring-expression-segmentation-on-davisReferFormer
Ju0026F 1st frame: 61.1
referring-expression-segmentation-on-refer-1ReferFormer (ResNet-50)
F: 56.6
J: 54.8
Ju0026F: 55.6
referring-expression-segmentation-on-refer-1ReferFormer (ResNet-101)
F: 58.4
J: 56.1
Ju0026F: 57.3
referring-video-object-segmentation-on-mevisReferFormer
F: 32.2
J: 29.8
Ju0026F: 31.0
referring-video-object-segmentation-on-refReferFormer
F: 64.1
J: 58.1
Ju0026F: 61.1
referring-video-object-segmentation-on-referReferFormer (Large)
F: 64.6
J: 61.3
Ju0026F: 62.9
referring-video-object-segmentation-on-revosReferFormer (Video-Swin-B)
F: 29.9
J: 26.2
Ju0026F: 28.1
R: 8.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Language as Queries for Referring Video Object Segmentation | Papers | HyperAI