
摘要
本文专注于开发一种更为有效的层次传播方法,用于半监督视频对象分割(VOS)。基于视觉变换器,最近提出的“利用变换器关联对象”(AOT)方法将层次传播引入VOS,并展示了令人鼓舞的结果。层次传播可以逐步从过去的帧中传播信息到当前帧,并将当前帧的特征从对象无关转变为对象特定。然而,随着对象特定信息的增加,深度传播层中不可避免地会出现对象无关视觉信息的丢失。为了解决这一问题并进一步促进视觉嵌入的学习,本文提出了一种“在层次传播中解耦特征”(DeAOT)的方法。首先,DeAOT通过在两个独立的分支中分别处理对象无关和对象特定的嵌入来解耦层次传播。其次,为了补偿双分支传播带来的额外计算开销,我们设计了一个高效的模块来构建层次传播,即门控传播模块(Gated Propagation Module),该模块精心设计了单头注意力机制。大量实验表明,DeAOT在准确性和效率方面显著优于AOT。在YouTube-VOS数据集上,DeAOT可以分别以22.4帧/秒的速度达到86.0%的准确率和以53.4帧/秒的速度达到82.0%的准确率。无需测试时增强的情况下,我们在四个基准测试中取得了新的最先进性能,分别是YouTube-VOS(86.2%)、DAVIS 2017(86.2%)、DAVIS 2016(92.9%)和VOT 2020(0.622)。项目页面:https://github.com/z-x-yang/AOT。
代码仓库
yoxu515/aot-benchmark
pytorch
GitHub 中提及
z-x-yang/AOT
官方
paddle
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| semi-supervised-video-object-segmentation-on-1 | DeAOT-S | F-measure (Mean): 79.0 FPS: 49.2 Ju0026F: 75.4 Jaccard (Mean): 71.9 |
| semi-supervised-video-object-segmentation-on-1 | DeAOT-B | F-measure (Mean): 79.9 FPS: 40.9 Ju0026F: 76.2 Jaccard (Mean): 72.5 |
| semi-supervised-video-object-segmentation-on-1 | DeAOT-L | F-measure (Mean): 81.7 FPS: 28.5 Ju0026F: 77.9 Jaccard (Mean): 74.1 |
| semi-supervised-video-object-segmentation-on-1 | DeAOT-T | F-measure (Mean): 77.3 FPS: 63.5 Ju0026F: 73.7 Jaccard (Mean): 70.0 |
| semi-supervised-video-object-segmentation-on-1 | R50-DeAOT-L | F-measure (Mean): 84.5 FPS: 27.0 Ju0026F: 80.7 Jaccard (Mean): 76.9 |
| semi-supervised-video-object-segmentation-on-1 | SwinB-DeAOT-L | F-measure (Mean): 86.7 FPS: 15.4 Ju0026F: 82.8 Jaccard (Mean): 78.9 |
| semi-supervised-video-object-segmentation-on-15 | SwinB-DeAOT-L | EAO: 0.622 EAO (real-time): 0.559 |
| semi-supervised-video-object-segmentation-on-15 | R50-DeAOT-L | EAO: 0.613 EAO (real-time): 0.571 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-B | EAO: 0.571 EAO (real-time): 0.542 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-L | EAO: 0.591 EAO (real-time): 0.554 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-T | EAO: 0.472 EAO (real-time): 0.463 |
| semi-supervised-video-object-segmentation-on-15 | DeAOT-S | EAO: 0.593 EAO (real-time): 0.559 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-B | F-Measure (Seen): 88.3 F-Measure (Unseen): 87.5 FPS: 30.4 Jaccard (Seen): 83.5 Jaccard (Unseen): 79.1 Overall: 84.6 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-L | F-Measure (Seen): 88.8 F-Measure (Unseen): 87.2 FPS: 24.7 Jaccard (Seen): 83.8 Jaccard (Unseen): 79.0 Overall: 84.7 |
| semi-supervised-video-object-segmentation-on-18 | R50-DeAOT-L | F-Measure (Seen): 89.4 F-Measure (Unseen): 88.9 FPS: 22.4 Jaccard (Seen): 84.6 Jaccard (Unseen): 80.8 Overall: 85.9 |
| semi-supervised-video-object-segmentation-on-18 | SwinB-DeAOT-L | F-Measure (Seen): 90.2 F-Measure (Unseen): 88.6 FPS: 11.9 Jaccard (Seen): 85.3 Jaccard (Unseen): 80.4 Overall: 86.1 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-S | F-Measure (Seen): 87.5 F-Measure (Unseen): 86.8 FPS: 38.7 Jaccard (Seen): 82.8 Jaccard (Unseen): 78.1 Overall: 83.8 |
| semi-supervised-video-object-segmentation-on-18 | DeAOT-T | F-Measure (Seen): 85.6 F-Measure (Unseen): 84.7 FPS: 53.4 Jaccard (Seen): 81.2 Jaccard (Unseen): 76.4 Overall: 82.0 |
| semi-supervised-video-object-segmentation-on-21 | DeAOT | F: 63.8 J: 55.1 Ju0026F: 59.4 |
| video-object-segmentation-on-youtube-vos | R50-DeAOT-L | F-Measure (Seen): 89.9 F-Measure (Unseen): 88.7 Jaccard (Seen): 84.9 Jaccard (Unseen): 80.4 Overall: 86.0 Params(M): 19.8 Speed (FPS): 22.4 |
| video-object-segmentation-on-youtube-vos | DeAOT-L | F-Measure (Seen): 89.4 Jaccard (Seen): 84.2 Jaccard (Unseen): 78.6 Overall: 84.8 Speed (FPS): 24.7 |
| video-object-segmentation-on-youtube-vos | SwinB-DeAOT-L | F-Measure (Seen): 90.6 F-Measure (Unseen): 88.4 Jaccard (Seen): 85.6 Jaccard (Unseen): 80.0 Overall: 86.2 Params(M): 70.3 Speed (FPS): 11.9 |
| video-object-segmentation-on-youtube-vos | DeAOT-S | F-Measure (Seen): 88.3 F-Measure (Unseen): 86.6 Jaccard (Seen): 83.3 Jaccard (Unseen): 77.9 Overall: 84.0 Params(M): 10.2 Speed (FPS): 38.7 |
| video-object-segmentation-on-youtube-vos | DeAOT-B | F-Measure (Seen): 88.9 F-Measure (Unseen): 87.0 Jaccard (Seen): 83.9 Jaccard (Unseen): 78.5 Overall: 84.6 Params(M): 13.2 Speed (FPS): 30.4 |
| video-object-segmentation-on-youtube-vos | DeAOT-T | F-Measure (Seen): 86.3 F-Measure (Unseen): 84.2 Jaccard (Seen): 81.6 Jaccard (Unseen): 75.8 Overall: 82.0 Params(M): 7.2 Speed (FPS): 53.4 |
| visual-object-tracking-on-davis-2016 | DeAOT-B | F-measure (Mean): 92.5 Ju0026F: 91.0 Jaccard (Mean): 89.4 Speed (FPS): 40.9 |
| visual-object-tracking-on-davis-2016 | DeAOT-L | F-measure (Mean): 93.7 Ju0026F: 92.0 Jaccard (Mean): 90.3 Speed (FPS): 28.5 |
| visual-object-tracking-on-davis-2016 | SwinB-DeAOT-L | F-measure (Mean): 94.7 Ju0026F: 92.9 Jaccard (Mean): 91.1 Speed (FPS): 15.4 |
| visual-object-tracking-on-davis-2016 | DeAOT-T | F-measure (Mean): 89.9 Ju0026F: 88.9 Jaccard (Mean): 87.8 Speed (FPS): 63.5 |
| visual-object-tracking-on-davis-2016 | R50-DeAOT-L | F-measure (Mean): 94.0 Ju0026F: 92.3 Jaccard (Mean): 90.5 Speed (FPS): 27.0 |
| visual-object-tracking-on-davis-2016 | DeAOT-S | F-measure (Mean): 90.9 Ju0026F: 89.3 Jaccard (Mean): 87.6 Speed (FPS): 49.2 |
| visual-object-tracking-on-davis-2017 | DeAOT-S | F-measure (Mean): 83.8 Ju0026F: 80.8 Jaccard (Mean): 77.8 Params(M): 10.2 Speed (FPS): 49.2 |
| visual-object-tracking-on-davis-2017 | DeAOT-L | F-measure (Mean): 87.1 Ju0026F: 84.1 Jaccard (Mean): 81.0 Params(M): 13.2 Speed (FPS): 28.5 |
| visual-object-tracking-on-davis-2017 | SwinB-DeAOT-L | F-measure (Mean): 89.2 Ju0026F: 86.2 Jaccard (Mean): 83.1 Params(M): 70.3 Speed (FPS): 15.4 |
| visual-object-tracking-on-davis-2017 | DeAOT-T | F-measure (Mean): 83.3 Ju0026F: 80.5 Jaccard (Mean): 77.7 Params(M): 7.2 Speed (FPS): 63.5 |
| visual-object-tracking-on-davis-2017 | R50-DeAOT-L | F-measure (Mean): 88.2 Ju0026F: 85.2 Jaccard (Mean): 82.2 Params(M): 19.8 Speed (FPS): 27.0 |
| visual-object-tracking-on-davis-2017 | DeAOT-B | F-measure (Mean): 85.1 Ju0026F: 82.2 Jaccard (Mean): 79.2 Params(M): 13.2 Speed (FPS): 40.9 |