
摘要
本文深入探讨了在半监督视频目标分割(Semi-Supervised Video Object Segmentation, VOS)任务中实现可扩展且高效多目标建模所面临的挑战。以往的VOS方法通常仅通过单一正样本对象解码特征,导致在多目标场景下必须分别匹配和分割每个目标,从而限制了多目标表示的学习能力。此外,早期方法多针对特定应用目标设计,缺乏灵活适应不同速度-精度权衡需求的能力。为解决上述问题,本文提出两种创新性方法:基于Transformer的多目标关联(Associating Objects with Transformers, AOT)与可扩展Transformer的多目标关联(Associating Objects with Scalable Transformers, AOST)。在实现高效多目标建模方面,AOT引入了ID(Identity)机制,为每个目标分配唯一标识,使网络能够在单次前向传播中同时建模所有目标之间的关联关系,从而实现高效的目标追踪与分割。为应对部署灵活性不足的问题,AOST进一步融合了可扩展的长短期Transformer结构,结合可扩展监督机制与逐层基于ID的注意力机制,首次实现了VOS任务中在线架构的可扩展性,并有效克服了传统ID嵌入表示能力的局限。鉴于目前尚无针对密集多目标标注的VOS基准,本文提出一个更具挑战性的“野外视频目标分割”(Video Object Segmentation in the Wild, VOSW)基准,用于验证所提方法的有效性。我们在VOSW以及五个广泛使用的VOS基准(包括YouTube-VOS 2018 & 2019 Val、DAVIS-2017 Val & Test、DAVIS-2016)上进行了大量实验,评估了多种AOT与AOST变体。实验结果表明,所提方法在全部六个基准上均显著超越现有最先进方法,展现出卓越的性能、效率与可扩展性。项目主页:https://github.com/yoxu515/aot-benchmark
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| semi-supervised-video-object-segmentation-on-1 | SwinB-AOST (L'=3) | F-measure (Mean): 86.6 FPS: 12.0 Ju0026F: 82.7 Jaccard (Mean): 78.8 |
| semi-supervised-video-object-segmentation-on-1 | SwinB-AOTv2-L | F-measure (Mean): 87.9 FPS: 1.3 Ju0026F: 84.5 Jaccard (Mean): 81.0 |
| semi-supervised-video-object-segmentation-on-1 | R50-AOST (L'=2) | F-measure (Mean): 81.7 FPS: 24.3 Ju0026F: 78.1 Jaccard (Mean): 74.5 |
| semi-supervised-video-object-segmentation-on-1 | R50-AOST (L'=3) | F-measure (Mean): 83.6 FPS: 17.5 Ju0026F: 79.9 Jaccard (Mean): 76.2 |
| semi-supervised-video-object-segmentation-on-1 | SwinB-AOST (L'=3, MS) | F-measure (Mean): 88.5 FPS: 1.3 Ju0026F: 84.7 Jaccard (Mean): 80.9 |
| semi-supervised-video-object-segmentation-on-18 | SwinB-AOTv2-L (all frames, MS) | F-Measure (Seen): 90.3 F-Measure (Unseen): 89.1 Jaccard (Seen): 85.5 Jaccard (Unseen): 81.0 Overall: 86.5 |
| semi-supervised-video-object-segmentation-on-18 | R50-AOST (L'=3) | F-Measure (Seen): 88.7 F-Measure (Unseen): 87.7 Jaccard (Seen): 83.8 Jaccard (Unseen): 79.3 Overall: 84.9 |
| semi-supervised-video-object-segmentation-on-18 | R50-AOST (L'=2) | F-Measure (Seen): 88.0 F-Measure (Unseen): 87.1 Jaccard (Seen): 83.3 Jaccard (Unseen): 78.9 Overall: 84.3 |
| semi-supervised-video-object-segmentation-on-18 | R50-AOST (L'=1) | F-Measure (Seen): 85.6 F-Measure (Unseen): 83.8 Jaccard (Seen): 81.0 Jaccard (Unseen): 754.8 Overall: 81.5 |
| semi-supervised-video-object-segmentation-on-18 | SwinB-AOTv2-L (all frames) | F-Measure (Seen): 88.9 F-Measure (Unseen): 88.0 Jaccard (Seen): 84.2 Jaccard (Unseen): 79.8 Overall: 85.2 |
| video-object-segmentation-on-youtube-vos | R50-AOTv2-L (all frames) | F-Measure (Seen): 90.2 F-Measure (Unseen): 87.3 Jaccard (Seen): 85.1 Jaccard (Unseen): 78.9 Overall: 85.4 Params(M): 15.1 Speed (FPS): 6.3 |
| video-object-segmentation-on-youtube-vos | R50-AOST (L'=2) | F-Measure (Seen): 88.5 F-Measure (Unseen): 87.2 Jaccard (Seen): 83.5 Jaccard (Unseen): 78.8 Overall: 84.5 Params(M): 13.9 Speed (FPS): 20.2 |
| video-object-segmentation-on-youtube-vos | R50-AOST (L'=3) | F-Measure (Seen): 88.8 F-Measure (Unseen): 87.9 Jaccard (Seen): 83.8 Jaccard (Unseen): 79.3 Overall: 85.0 Params(M): 15.4 Speed (FPS): 14.9 |
| video-object-segmentation-on-youtube-vos | R50-AOST (L'=1) | F-Measure (Seen): 86.1 F-Measure (Unseen): 83.5 Jaccard (Seen): 81.4 Jaccard (Unseen): 75.5 Overall: 81.6 Params(M): 12.5 Speed (FPS): 30.9 |
| video-object-segmentation-on-youtube-vos | SwinB-AOTv2-L (all frames) | F-Measure (Seen): 90.1 F-Measure (Unseen): 88.2 Jaccard (Unseen): 79.6 Overall: 85.8 Speed (FPS): 5.1 |
| video-object-segmentation-on-youtube-vos | SwinB-AOTv2-L (all frames, MS) | F-Measure (Seen): 90.7 F-Measure (Unseen): 88.9 Jaccard (Seen): 85.6 Jaccard (Unseen): 80.7 Overall: 86.5 Params(M): 65.6 Speed (FPS): 0.7 |
| visual-object-tracking-on-davis-2016 | R50-AOST (L'=3) | F-measure (Mean): 93.6 Ju0026F: 92.1 Jaccard (Mean): 90.6 Speed (FPS): 17.5 |
| visual-object-tracking-on-davis-2016 | SwinB-AOTv2-L | F-measure (Mean): 94.1 Ju0026F: 92.4 Jaccard (Mean): 90.6 Speed (FPS): 12.0 |
| visual-object-tracking-on-davis-2016 | SwinB-AOST (L'=3, MS) | F-measure (Mean): 94.5 Ju0026F: 93.0 Jaccard (Mean): 91.5 Speed (FPS): 1.3 |
| visual-object-tracking-on-davis-2016 | SwinB-AOST (L'=3) | F-measure (Mean): 94.2 Ju0026F: 92.4 Jaccard (Mean): 90.5 Speed (FPS): 12.0 |
| visual-object-tracking-on-davis-2016 | R50-AOST (L'=2) | F-measure (Mean): 93.4 Ju0026F: 92.0 Jaccard (Mean): 90.5 Speed (FPS): 24.3 |
| visual-object-tracking-on-davis-2016 | SwinB-AOTv2-L (MS) | F-measure (Mean): 94.4 Ju0026F: 93.0 Jaccard (Mean): 91.6 Speed (FPS): 1.3 |
| visual-object-tracking-on-davis-2016 | R50-AOST (L'=1) | F-measure (Mean): 90.9 Ju0026F: 90.3 Jaccard (Mean): 89.6 Speed (FPS): 37.4 |
| visual-object-tracking-on-davis-2017 | SwinB-AOST (L'=3, MS) | F-measure (Mean): 89.5 Ju0026F: 86.7 Jaccard (Mean): 83.8 Params(M): 65.6 Speed (FPS): 1.3 |
| visual-object-tracking-on-davis-2017 | R50-AOST (L'=1) | F-measure (Mean): 86.1 Ju0026F: 83.7 Jaccard (Mean): 81.2 Params(M): 12.5 Speed (FPS): 37.4 |
| visual-object-tracking-on-davis-2017 | SwinB-AOTv2-L | F-measure (Mean): 89.4 Ju0026F: 86.3 Jaccard (Mean): 83.1 Params(M): 65.6 Speed (FPS): 12.0 |
| visual-object-tracking-on-davis-2017 | SwinB-AOTv2-L (MS) | F-measure (Mean): 89.8 Ju0026F: 87.0 Jaccard (Mean): 84.2 Params(M): 65.6 Speed (FPS): 1.3 |
| visual-object-tracking-on-davis-2017 | R50-AOST (L'=2) | F-measure (Mean): 88.0 Ju0026F: 85.3 Jaccard (Mean): 82.5 Params(M): 13.9 Speed (FPS): 24.3 |
| visual-object-tracking-on-davis-2017 | R50-AOST (L'=3) | F-measure (Mean): 88.5 Ju0026F: 85.6 Jaccard (Mean): 82.6 Params(M): 15.4 Speed (FPS): 17.5 |