Command Palette
Search for a command to run...
Decoupling Features in Hierarchical Propagation for Video Object Segmentation
Zongxin Yang; Yi Yang

Abstract
This paper focuses on developing a more effective method of hierarchical propagation for semi-supervised Video Object Segmentation (VOS). Based on vision transformers, the recently-developed Associating Objects with Transformers (AOT) approach introduces hierarchical propagation into VOS and has shown promising results. The hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific. However, the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, this paper proposes a Decoupling Features in Hierarchical Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Secondly, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module, which is carefully designed with single-head attention. Extensive experiments show that DeAOT significantly outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations, we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622). Project page: https://github.com/z-x-yang/AOT.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| semi-supervised-video-object-segmentation-on-1 | DeAOT-S | F-measure (Mean): 79.0 FPS: 49.2 Ju0026F: 75.4 Jaccard (Mean): 71.9 |
| semi-supervised-video-object-segmentation-on-1 | DeAOT-B | F-measure (Mean): 79.9 FPS: 40.9 Ju0026F: 76.2 Jaccard (Mean): 72.5 |
| semi-supervised-video-object-segmentation-on-1 | DeAOT-L | F-measure (Mean): 81.7 FPS: 28.5 Ju0026F: 77.9 Jaccard (Mean): 74.1 |
| semi-supervised-video-object-segmentation-on-1 | DeAOT-T | F-measure (Mean): 77.3 FPS: 63.5 Ju0026F: 73.7 Jaccard (Mean): 70.0 |
| semi-supervised-video-object-segmentation-on-1 | R50-DeAOT-L | F-measure (Mean): 84.5 FPS: 27.0 Ju0026F: 80.7 Jaccard (Mean): 76.9 |
| semi-supervised-video-object-segmentation-on-1 | SwinB-DeAOT-L | F-measure (Mean): 86.7 FPS: 15.4 Ju0026F: 82.8 Jaccard (Mean): 78.9 |
| semi-supervised-video-object-segmentation-on-15 | SwinB-DeAOT-L | EAO: 0.622 EAO (real-time): 0.559 |
| semi-supervised-video-object-segmentation-on-15 | R50-DeAOT-L | EAO: 0.613 EAO (real-time): 0.571 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-B | EAO: 0.571 EAO (real-time): 0.542 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-L | EAO: 0.591 EAO (real-time): 0.554 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-T | EAO: 0.472 EAO (real-time): 0.463 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-S | EAO: 0.593 EAO (real-time): 0.559 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-B | F-Measure (Seen): 88.3 F-Measure (Unseen): 87.5 FPS: 30.4 Jaccard (Seen): 83.5 Jaccard (Unseen): 79.1 Overall: 84.6 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-L | F-Measure (Seen): 88.8 F-Measure (Unseen): 87.2 FPS: 24.7 Jaccard (Seen): 83.8 Jaccard (Unseen): 79.0 Overall: 84.7 |
| semi-supervised-video-object-segmentation-on-18 | R50-DeAOT-L | F-Measure (Seen): 89.4 F-Measure (Unseen): 88.9 FPS: 22.4 Jaccard (Seen): 84.6 Jaccard (Unseen): 80.8 Overall: 85.9 |
| semi-supervised-video-object-segmentation-on-18 | SwinB-DeAOT-L | F-Measure (Seen): 90.2 F-Measure (Unseen): 88.6 FPS: 11.9 Jaccard (Seen): 85.3 Jaccard (Unseen): 80.4 Overall: 86.1 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-S | F-Measure (Seen): 87.5 F-Measure (Unseen): 86.8 FPS: 38.7 Jaccard (Seen): 82.8 Jaccard (Unseen): 78.1 Overall: 83.8 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-T | F-Measure (Seen): 85.6 F-Measure (Unseen): 84.7 FPS: 53.4 Jaccard (Seen): 81.2 Jaccard (Unseen): 76.4 Overall: 82.0 |
| semi-supervised-video-object-segmentation-on-21 | DeAOT | F: 63.8 J: 55.1 Ju0026F: 59.4 |
| video-object-segmentation-on-youtube-vos | R50-DeAOT-L | F-Measure (Seen): 89.9 F-Measure (Unseen): 88.7 Jaccard (Seen): 84.9 Jaccard (Unseen): 80.4 Overall: 86.0 Params(M): 19.8 Speed (FPS): 22.4 |
| video-object-segmentation-on-youtube-vos | DeAOT-L | F-Measure (Seen): 89.4 Jaccard (Seen): 84.2 Jaccard (Unseen): 78.6 Overall: 84.8 Speed (FPS): 24.7 |
| video-object-segmentation-on-youtube-vos | SwinB-DeAOT-L | F-Measure (Seen): 90.6 F-Measure (Unseen): 88.4 Jaccard (Seen): 85.6 Jaccard (Unseen): 80.0 Overall: 86.2 Params(M): 70.3 Speed (FPS): 11.9 |
| video-object-segmentation-on-youtube-vos | DeAOT-S | F-Measure (Seen): 88.3 F-Measure (Unseen): 86.6 Jaccard (Seen): 83.3 Jaccard (Unseen): 77.9 Overall: 84.0 Params(M): 10.2 Speed (FPS): 38.7 |
| video-object-segmentation-on-youtube-vos | DeAOT-B | F-Measure (Seen): 88.9 F-Measure (Unseen): 87.0 Jaccard (Seen): 83.9 Jaccard (Unseen): 78.5 Overall: 84.6 Params(M): 13.2 Speed (FPS): 30.4 |
| video-object-segmentation-on-youtube-vos | DeAOT-T | F-Measure (Seen): 86.3 F-Measure (Unseen): 84.2 Jaccard (Seen): 81.6 Jaccard (Unseen): 75.8 Overall: 82.0 Params(M): 7.2 Speed (FPS): 53.4 |
| visual-object-tracking-on-davis-2016 | DeAOT-B | F-measure (Mean): 92.5 Ju0026F: 91.0 Jaccard (Mean): 89.4 Speed (FPS): 40.9 |
| visual-object-tracking-on-davis-2016 | DeAOT-L | F-measure (Mean): 93.7 Ju0026F: 92.0 Jaccard (Mean): 90.3 Speed (FPS): 28.5 |
| visual-object-tracking-on-davis-2016 | SwinB-DeAOT-L | F-measure (Mean): 94.7 Ju0026F: 92.9 Jaccard (Mean): 91.1 Speed (FPS): 15.4 |
| visual-object-tracking-on-davis-2016 | DeAOT-T | F-measure (Mean): 89.9 Ju0026F: 88.9 Jaccard (Mean): 87.8 Speed (FPS): 63.5 |
| visual-object-tracking-on-davis-2016 | R50-DeAOT-L | F-measure (Mean): 94.0 Ju0026F: 92.3 Jaccard (Mean): 90.5 Speed (FPS): 27.0 |
| visual-object-tracking-on-davis-2016 | DeAOT-S | F-measure (Mean): 90.9 Ju0026F: 89.3 Jaccard (Mean): 87.6 Speed (FPS): 49.2 |
| visual-object-tracking-on-davis-2017 | DeAOT-S | F-measure (Mean): 83.8 Ju0026F: 80.8 Jaccard (Mean): 77.8 Params(M): 10.2 Speed (FPS): 49.2 |
| visual-object-tracking-on-davis-2017 | DeAOT-L | F-measure (Mean): 87.1 Ju0026F: 84.1 Jaccard (Mean): 81.0 Params(M): 13.2 Speed (FPS): 28.5 |
| visual-object-tracking-on-davis-2017 | SwinB-DeAOT-L | F-measure (Mean): 89.2 Ju0026F: 86.2 Jaccard (Mean): 83.1 Params(M): 70.3 Speed (FPS): 15.4 |
| visual-object-tracking-on-davis-2017 | DeAOT-T | F-measure (Mean): 83.3 Ju0026F: 80.5 Jaccard (Mean): 77.7 Params(M): 7.2 Speed (FPS): 63.5 |
| visual-object-tracking-on-davis-2017 | R50-DeAOT-L | F-measure (Mean): 88.2 Ju0026F: 85.2 Jaccard (Mean): 82.2 Params(M): 19.8 Speed (FPS): 27.0 |
| visual-object-tracking-on-davis-2017 | DeAOT-B | F-measure (Mean): 85.1 Ju0026F: 82.2 Jaccard (Mean): 79.2 Params(M): 13.2 Speed (FPS): 40.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.