HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

{Yu Qiao Xiaojun Chang Lina Yao Zhihui Li Yali Wang Mingfei Han}

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

Abstract

Referring Video Object Segmentation (RVOS) is to segment the object instance from a given video, according to the textual description of this object. However, in the open world, the object descriptions are often diversified in contents and flexible in lengths. This leads to the key difficulty in RVOS, i.e., various descriptions of different ob- jects are corresponding to different temporal scales in the video, which is ignored by most existing approaches with single stride of frame sampling. To tackle this problem, we propose a concise Hybrid Temporal-scale Multimodal Learning (HTML) framework, which can effectively align lingual and visual features to discover core object semantics in the video, by learning multimodal interaction hierarchically from different temporal scales. More specifically, we introduce a novel inter-scale multimodal perception module, where the language queries dynamically interact with visual features across temporal scales. It can effectively reduce complex object confusion by passing video context among different scales. Finally, we conduct extensive experiments on the widely used benchmarks, including Ref- Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB- Sentences, where our HTML achieves state-of-the-art performance on all these datasets.

Benchmarks

BenchmarkMethodologyMetrics
referring-video-object-segmentation-on-refHTML
F: 65.1
J: 59.2
Ju0026F: 62.1
referring-video-object-segmentation-on-referHTML-Video-SwinT
F: 63.0
J: 59.5
Ju0026F: 61.2
referring-video-object-segmentation-on-referHTML-SwinL
F: 65.3
J: 61.5
Ju0026F: 63.4
referring-video-object-segmentation-on-referHTML-Video-SwinB
F: 65.2
J: 61.5
Ju0026F: 63.4
referring-video-object-segmentation-on-referHTML-ResNet101
F: 59.8
J: 57.3
Ju0026F: 58.5
referring-video-object-segmentation-on-referHTML-ResNet50
F: 59.0
J: 56.5
Ju0026F: 57.8
referring-video-object-segmentation-on-referHTML-Video-SwinS
F: 62.9
J: 59.9
Ju0026F: 61.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation | Papers | HyperAI