3 个月前

基于识别机制的可扩展视频对象分割

基于识别机制的可扩展视频对象分割

摘要

本文深入探讨了在半监督视频目标分割(Semi-Supervised Video Object Segmentation, VOS)任务中实现可扩展且高效多目标建模所面临的挑战。以往的VOS方法通常仅通过单一正样本对象解码特征,导致在多目标场景下必须分别匹配和分割每个目标,从而限制了多目标表示的学习能力。此外,早期方法多针对特定应用目标设计,缺乏灵活适应不同速度-精度权衡需求的能力。为解决上述问题,本文提出两种创新性方法:基于Transformer的多目标关联(Associating Objects with Transformers, AOT)与可扩展Transformer的多目标关联(Associating Objects with Scalable Transformers, AOST)。在实现高效多目标建模方面,AOT引入了ID(Identity)机制,为每个目标分配唯一标识,使网络能够在单次前向传播中同时建模所有目标之间的关联关系,从而实现高效的目标追踪与分割。为应对部署灵活性不足的问题,AOST进一步融合了可扩展的长短期Transformer结构,结合可扩展监督机制与逐层基于ID的注意力机制,首次实现了VOS任务中在线架构的可扩展性,并有效克服了传统ID嵌入表示能力的局限。鉴于目前尚无针对密集多目标标注的VOS基准,本文提出一个更具挑战性的“野外视频目标分割”(Video Object Segmentation in the Wild, VOSW)基准,用于验证所提方法的有效性。我们在VOSW以及五个广泛使用的VOS基准(包括YouTube-VOS 2018 & 2019 Val、DAVIS-2017 Val & Test、DAVIS-2016)上进行了大量实验,评估了多种AOT与AOST变体。实验结果表明,所提方法在全部六个基准上均显著超越现有最先进方法,展现出卓越的性能、效率与可扩展性。项目主页:https://github.com/yoxu515/aot-benchmark

代码仓库

yoxu515/aot-benchmark
官方
pytorch
GitHub 中提及
z-x-yang/AOT
官方
paddle
GitHub 中提及

基准测试

基准方法指标
semi-supervised-video-object-segmentation-on-1SwinB-AOST (L'=3)
F-measure (Mean): 86.6
FPS: 12.0
Ju0026F: 82.7
Jaccard (Mean): 78.8
semi-supervised-video-object-segmentation-on-1SwinB-AOTv2-L
F-measure (Mean): 87.9
FPS: 1.3
Ju0026F: 84.5
Jaccard (Mean): 81.0
semi-supervised-video-object-segmentation-on-1R50-AOST (L'=2)
F-measure (Mean): 81.7
FPS: 24.3
Ju0026F: 78.1
Jaccard (Mean): 74.5
semi-supervised-video-object-segmentation-on-1R50-AOST (L'=3)
F-measure (Mean): 83.6
FPS: 17.5
Ju0026F: 79.9
Jaccard (Mean): 76.2
semi-supervised-video-object-segmentation-on-1SwinB-AOST (L'=3, MS)
F-measure (Mean): 88.5
FPS: 1.3
Ju0026F: 84.7
Jaccard (Mean): 80.9
semi-supervised-video-object-segmentation-on-18SwinB-AOTv2-L (all frames, MS)
F-Measure (Seen): 90.3
F-Measure (Unseen): 89.1
Jaccard (Seen): 85.5
Jaccard (Unseen): 81.0
Overall: 86.5
semi-supervised-video-object-segmentation-on-18R50-AOST (L'=3)
F-Measure (Seen): 88.7
F-Measure (Unseen): 87.7
Jaccard (Seen): 83.8
Jaccard (Unseen): 79.3
Overall: 84.9
semi-supervised-video-object-segmentation-on-18R50-AOST (L'=2)
F-Measure (Seen): 88.0
F-Measure (Unseen): 87.1
Jaccard (Seen): 83.3
Jaccard (Unseen): 78.9
Overall: 84.3
semi-supervised-video-object-segmentation-on-18R50-AOST (L'=1)
F-Measure (Seen): 85.6
F-Measure (Unseen): 83.8
Jaccard (Seen): 81.0
Jaccard (Unseen): 754.8
Overall: 81.5
semi-supervised-video-object-segmentation-on-18SwinB-AOTv2-L (all frames)
F-Measure (Seen): 88.9
F-Measure (Unseen): 88.0
Jaccard (Seen): 84.2
Jaccard (Unseen): 79.8
Overall: 85.2
video-object-segmentation-on-youtube-vosR50-AOTv2-L (all frames)
F-Measure (Seen): 90.2
F-Measure (Unseen): 87.3
Jaccard (Seen): 85.1
Jaccard (Unseen): 78.9
Overall: 85.4
Params(M): 15.1
Speed (FPS): 6.3
video-object-segmentation-on-youtube-vosR50-AOST (L'=2)
F-Measure (Seen): 88.5
F-Measure (Unseen): 87.2
Jaccard (Seen): 83.5
Jaccard (Unseen): 78.8
Overall: 84.5
Params(M): 13.9
Speed (FPS): 20.2
video-object-segmentation-on-youtube-vosR50-AOST (L'=3)
F-Measure (Seen): 88.8
F-Measure (Unseen): 87.9
Jaccard (Seen): 83.8
Jaccard (Unseen): 79.3
Overall: 85.0
Params(M): 15.4
Speed (FPS): 14.9
video-object-segmentation-on-youtube-vosR50-AOST (L'=1)
F-Measure (Seen): 86.1
F-Measure (Unseen): 83.5
Jaccard (Seen): 81.4
Jaccard (Unseen): 75.5
Overall: 81.6
Params(M): 12.5
Speed (FPS): 30.9
video-object-segmentation-on-youtube-vosSwinB-AOTv2-L (all frames)
F-Measure (Seen): 90.1
F-Measure (Unseen): 88.2
Jaccard (Unseen): 79.6
Overall: 85.8
Speed (FPS): 5.1
video-object-segmentation-on-youtube-vosSwinB-AOTv2-L (all frames, MS)
F-Measure (Seen): 90.7
F-Measure (Unseen): 88.9
Jaccard (Seen): 85.6
Jaccard (Unseen): 80.7
Overall: 86.5
Params(M): 65.6
Speed (FPS): 0.7
visual-object-tracking-on-davis-2016R50-AOST (L'=3)
F-measure (Mean): 93.6
Ju0026F: 92.1
Jaccard (Mean): 90.6
Speed (FPS): 17.5
visual-object-tracking-on-davis-2016SwinB-AOTv2-L
F-measure (Mean): 94.1
Ju0026F: 92.4
Jaccard (Mean): 90.6
Speed (FPS): 12.0
visual-object-tracking-on-davis-2016SwinB-AOST (L'=3, MS)
F-measure (Mean): 94.5
Ju0026F: 93.0
Jaccard (Mean): 91.5
Speed (FPS): 1.3
visual-object-tracking-on-davis-2016SwinB-AOST (L'=3)
F-measure (Mean): 94.2
Ju0026F: 92.4
Jaccard (Mean): 90.5
Speed (FPS): 12.0
visual-object-tracking-on-davis-2016R50-AOST (L'=2)
F-measure (Mean): 93.4
Ju0026F: 92.0
Jaccard (Mean): 90.5
Speed (FPS): 24.3
visual-object-tracking-on-davis-2016SwinB-AOTv2-L (MS)
F-measure (Mean): 94.4
Ju0026F: 93.0
Jaccard (Mean): 91.6
Speed (FPS): 1.3
visual-object-tracking-on-davis-2016R50-AOST (L'=1)
F-measure (Mean): 90.9
Ju0026F: 90.3
Jaccard (Mean): 89.6
Speed (FPS): 37.4
visual-object-tracking-on-davis-2017SwinB-AOST (L'=3, MS)
F-measure (Mean): 89.5
Ju0026F: 86.7
Jaccard (Mean): 83.8
Params(M): 65.6
Speed (FPS): 1.3
visual-object-tracking-on-davis-2017R50-AOST (L'=1)
F-measure (Mean): 86.1
Ju0026F: 83.7
Jaccard (Mean): 81.2
Params(M): 12.5
Speed (FPS): 37.4
visual-object-tracking-on-davis-2017SwinB-AOTv2-L
F-measure (Mean): 89.4
Ju0026F: 86.3
Jaccard (Mean): 83.1
Params(M): 65.6
Speed (FPS): 12.0
visual-object-tracking-on-davis-2017SwinB-AOTv2-L (MS)
F-measure (Mean): 89.8
Ju0026F: 87.0
Jaccard (Mean): 84.2
Params(M): 65.6
Speed (FPS): 1.3
visual-object-tracking-on-davis-2017R50-AOST (L'=2)
F-measure (Mean): 88.0
Ju0026F: 85.3
Jaccard (Mean): 82.5
Params(M): 13.9
Speed (FPS): 24.3
visual-object-tracking-on-davis-2017R50-AOST (L'=3)
F-measure (Mean): 88.5
Ju0026F: 85.6
Jaccard (Mean): 82.6
Params(M): 15.4
Speed (FPS): 17.5

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供