HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Bo Miao Mohammed Bennamoun Yongsheng Gao Ajmal Mian

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Abstract

Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3 times faster while maintaining satisfactory performance. Code is available at https://github.com/bo-miao/SgMg.

Code Repositories

bo-miao/sgmg
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
referring-expression-segmentation-on-a2dSgMg (Video-Swin-B)
AP: 0.585
IoU mean: 0.720
IoU overall: 0.799
Precision@0.5: 0.843
Precision@0.6: 0.822
Precision@0.7: 0.767
Precision@0.8: 0.617
Precision@0.9: 0.259
referring-expression-segmentation-on-davisSgMg
Ju0026F 1st frame: 63.3
referring-expression-segmentation-on-j-hmdbSgMg (Video-Swin-B)
AP: 0.450
IoU mean: 0.725
IoU overall: 0.737
Precision@0.5: 0.972
Precision@0.6: 0.917
Precision@0.7: 0.714
Precision@0.8: 0.225
Precision@0.9: 0.003
referring-expression-segmentation-on-refer-1SgMg (Pre-training)
F: 67.4
J: 63.9
Ju0026F: 65.7
referring-video-object-segmentation-on-refSgMg
F: 66.0
J: 60.6
Ju0026F: 63.3
referring-video-object-segmentation-on-referSgMg
F: 67.4
J: 63.9
Ju0026F: 65.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Spectrum-guided Multi-granularity Referring Video Object Segmentation | Papers | HyperAI