Command Palette
Search for a command to run...
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Wangbo Zhao Kai Wang Xiangxiang Chu Fuzhao Xue Xinchao Wang Yang You

Abstract
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| referring-expression-segmentation-on-a2d | mmmmtbvs | AP: 0.419 IoU mean: 0.558 IoU overall: 0.673 Precision@0.5: 0.645 Precision@0.6: 0.597 Precision@0.7: 0.523 Precision@0.8: 0.375 Precision@0.9: 0.13 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.