Command Palette
Search for a command to run...
Hierarchical interaction network for video object segmentation from referring expressions
{Philip Torr Hengshuang Zhao Luca Bertinetto Yansong Tang Zhao Yang}

Abstract
In this paper, we investigate the problem of video object segmentation from referring expressions (VOSRE). Conventional methods typically perform multi-modal fusion based on linguistic features and the visual features extracted from the top layer of the visual encoder, which limits these models' ability to represent multi-modal inputs at different semantic and spatial granularity levels. To address this issue, we present an end-to-end hierarchical interaction network (HINet) for the VOSRE problem. Our model leverages the feature pyramid produced by the visual encoder to generate multiple levels of multi-modal features. This allows more flexible representation of various linguistic concepts (e.g., object attributes and categories) in different levels of the multi-modal features. Moreover, we further extract signals of moving objects from optical flow input, and utilize them as complementary cues for highlighting the referent and suppressing the background with a motion gating mechanism. In contrast to previous methods, this strategy allows our model to make online predictions without requiring the whole video as input. Despite its simplicity, our proposed HINet improves over the previous state of the art on the DAVIS-16, DAVIS-17, and J-HMDB datasets for the VOSRE task, demonstrating its effectiveness and generality.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| referring-expression-segmentation-on-a2d | RefVOS | IoU mean: 0.497 IoU overall: 0.672 Precision@0.5: 0.578 Precision@0.6: 0.534 Precision@0.7: 0.456 Precision@0.8: 0.311 Precision@0.9: 0.093 |
| referring-expression-segmentation-on-a2d | HINet | IoU mean: 0.529 IoU overall: 0.679 Precision@0.5: 0.611 Precision@0.6: 0.559 Precision@0.7: 0.486 Precision@0.8: 0.342 Precision@0.9: 0.12 |
| referring-expression-segmentation-on-davis | HINet | Ju0026F 1st frame: 50.2 Ju0026F Full video: 47.9 |
| referring-expression-segmentation-on-j-hmdb | RefVOS | IoU mean: 0.568 IoU overall: 0.606 Precision@0.5: 0.731 Precision@0.6: 0.62 Precision@0.7: 0.392 Precision@0.8: 0.088 Precision@0.9: 0.0 |
| referring-expression-segmentation-on-j-hmdb | HINet | IoU mean: 0.627 IoU overall: 0.652 Precision@0.5: 0.819 Precision@0.6: 0.736 Precision@0.7: 0.542 Precision@0.8: 0.168 Precision@0.9: 0.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.