HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Associating Objects with Transformers for Video Object Segmentation

Zongxin Yang; Yunchao Wei; Yi Yang

Associating Objects with Transformers for Video Object Segmentation

Abstract

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than $3\times$ faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.

Code Repositories

yoxu515/aot-benchmark
pytorch
Mentioned in GitHub
z-x-yang/AOT
Official
paddle
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
semi-supervised-video-object-segmentation-on-1SwinB-AOT-L
F-measure (Mean): 85.1
FPS: 12.1
Ju0026F: 81.2
Jaccard (Mean): 77.3
semi-supervised-video-object-segmentation-on-1AOT-L
F-measure (Mean): 82.3
FPS: 18.7
Ju0026F: 78.3
Jaccard (Mean): 74.3
semi-supervised-video-object-segmentation-on-1AOT-T
F-measure (Mean): 75.7
FPS: 51.4
Ju0026F: 72.0
Jaccard (Mean): 68.3
semi-supervised-video-object-segmentation-on-1AOT-S
F-measure (Mean): 77.5
FPS: 40.0
Ju0026F: 73.9
Jaccard (Mean): 70.3
semi-supervised-video-object-segmentation-on-1AOT-B
F-measure (Mean): 79.3
FPS: 29.6
Ju0026F: 75.5
Jaccard (Mean): 71.6
semi-supervised-video-object-segmentation-on-1R50-AOT-L
F-measure (Mean): 83.3
FPS: 18.0
Ju0026F: 79.6
Jaccard (Mean): 75.9
semi-supervised-video-object-segmentation-on-15SwinB-AOT-L
EAO: 0.586
EAO (real-time): 0.523
semi-supervised-video-object-segmentation-on-15AOT-S
EAO: 0.512
EAO (real-time): 0.499
semi-supervised-video-object-segmentation-on-15AOT-B
EAO: 0.541
EAO (real-time): 0.533
semi-supervised-video-object-segmentation-on-15AOT-L
EAO: 0.574
EAO (real-time): 0.560
semi-supervised-video-object-segmentation-on-15R50-AOT-L
EAO: 0.569
EAO (real-time): 0.540
semi-supervised-video-object-segmentation-on-15AOT-T
EAO: 0.435
EAO (real-time): 0.433
semi-supervised-video-object-segmentation-on-20AOT-S
D17 val (F): 82.0
D17 val (G): 79.2
D17 val (J): 76.4
FPS: 40.0
semi-supervised-video-object-segmentation-on-21AOT
F: 61.3
J: 53.1
Ju0026F: 57.2
video-object-segmentation-on-davis-2017-test-1AOT
F-measure: 83.3
Jaccard: 75.9
Mean Jaccard u0026 F-Measure: 79.6
video-object-segmentation-on-youtube-vosAOT-T (all frames)
F-Measure (Seen): 84.7
F-Measure (Unseen): 83.5
Jaccard (Seen): 80.0
Jaccard (Unseen): 75.2
Overall: 80.9
Params(M): 5.3
Speed (FPS): 41.0
video-object-segmentation-on-youtube-vosR50-AOT-L (all frames)
F-Measure (Seen): 89.5
F-Measure (Unseen): 88.2
Jaccard (Seen): 84.5
Jaccard (Unseen): 79.6
Overall: 85.5
Params(M): 14.9
Speed (FPS): 6.4
video-object-segmentation-on-youtube-vosAOT-B (all frames)
F-Measure (Seen): 88.5
F-Measure (Unseen): 86.5
Jaccard (Seen): 83.6
Jaccard (Unseen): 78.0
Overall: 84.1
Params(M): 8.3
Speed (FPS): 20.5
video-object-segmentation-on-youtube-vosAOT-B
F-Measure (Seen): 87.5
F-Measure (Unseen): 86.0
Jaccard (Seen): 82.6
Jaccard (Unseen): 77.7
Overall: 83.5
Params(M): 8.3
Speed (FPS): 20.5
video-object-segmentation-on-youtube-vosAOT-S (all frames)
F-Measure (Seen): 87.0
F-Measure (Unseen): 85.7
Jaccard (Seen): 82.2
Jaccard (Unseen): 77.3
Overall: 83.0
Params(M): 7.9
Speed (FPS): 27.1
video-object-segmentation-on-youtube-vosAOT-S
F-Measure (Seen): 86.7
F-Measure (Unseen): 85.0
Jaccard (Seen): 82.0
Jaccard (Unseen): 76.6
Overall: 82.6
Params(M): 7.9
Speed (FPS): 27.1
video-object-segmentation-on-youtube-vosR50-AOT-L
F-Measure (Seen): 88.5
F-Measure (Unseen): 86.1
Jaccard (Seen): 83.7
Jaccard (Unseen): 78.1
Overall: 84.1
Params(M): 14.9
Speed (FPS): 14.9
video-object-segmentation-on-youtube-vosSwinB-AOT-L
F-Measure (Seen): 89.3
F-Measure (Unseen): 86.4
Jaccard (Seen): 84.3
Jaccard (Unseen): 77.9
Overall: 84.5
Params(M): 65.4
Speed (FPS): 9.3
video-object-segmentation-on-youtube-vosSwinB-AOT-L (all frames)
F-Measure (Seen): 90.1
F-Measure (Unseen): 86.9
Jaccard (Seen): 85.1
Jaccard (Unseen): 78.4
Overall: 85.1
Params(M): 65.4
Speed (FPS): 5.2
video-object-segmentation-on-youtube-vosAOT-L (all frames)
F-Measure (Seen): 88.8
F-Measure (Unseen): 87.1
Jaccard (Seen): 83.7
Jaccard (Unseen): 78.4
Overall: 84.5
Params(M): 8.3
Speed (FPS): 6.5
video-object-segmentation-on-youtube-vosAOT-T
F-Measure (Seen): 84.5
F-Measure (Unseen): 82.2
Jaccard (Seen): 80.1
Jaccard (Unseen): 74.0
Overall: 80.2
Params(M): 5.3
Speed (FPS): 41.0
video-object-segmentation-on-youtube-vosAOT-L
F-Measure (Seen): 87.9
F-Measure (Unseen): 86.5
Jaccard (Seen): 82.9
Jaccard (Unseen): 77.7
Overall: 83.8
Params(M): 8.3
Speed (FPS): 16.0
video-object-segmentation-on-youtube-vos-2019-2AOT
F-Measure (Seen): 88.1
F-Measure (Unseen): 86.3
Jaccard (Seen): 83.5
Jaccard (Unseen): 78.4
Mean Jaccard u0026 F-Measure: 84.1
visual-object-tracking-on-davis-2016SwinB-AOT-L
F-measure (Mean): 93.3
Ju0026F: 92.0
Jaccard (Mean): 90.7
Speed (FPS): 12.1
visual-object-tracking-on-davis-2016AOT-L
F-measure (Mean): 91.1
Ju0026F: 90.4
Jaccard (Mean): 89.6
Speed (FPS): 18.7
visual-object-tracking-on-davis-2016AOT-L
F-measure (Mean): 91.1
Ju0026F: 89.9
Jaccard (Mean): 88.7
Speed (FPS): 29.6
visual-object-tracking-on-davis-2016R50-AOT-L
F-measure (Mean): 92.1
Ju0026F: 91.1
Jaccard (Mean): 90.1
Speed (FPS): 18.0
visual-object-tracking-on-davis-2016AOT-S
F-measure (Mean): 90.2
Ju0026F: 89.4
Jaccard (Mean): 88.6
Speed (FPS): 40.0
visual-object-tracking-on-davis-2016AOT-T
F-measure (Mean): 87.4
Ju0026F: 86.8
Jaccard (Mean): 86.1
Speed (FPS): 51.4
visual-object-tracking-on-davis-2017AOT-S
F-measure (Mean): 83.9
Ju0026F: 81.3
Jaccard (Mean): 78.7
Params(M): 7.0
Speed (FPS): 40.0
visual-object-tracking-on-davis-2017SwinB-AOT-L
F-measure (Mean): 88.4
Ju0026F: 85.4
Jaccard (Mean): 82.4
Params(M): 65.4
Speed (FPS): 12.1
visual-object-tracking-on-davis-2017R50-AOT-L
F-measure (Mean): 87.5
Ju0026F: 84.9
Jaccard (Mean): 82.3
Params(M): 14.9
Speed (FPS): 18.0
visual-object-tracking-on-davis-2017AOT-T
F-measure (Mean): 82.3
Ju0026F: 79.9
Jaccard (Mean): 77.4
Params(M): 5.7
Speed (FPS): 51.4
visual-object-tracking-on-davis-2017AOT-L
F-measure (Mean): 86.4
Ju0026F: 83.8
Jaccard (Mean): 81.1
Params(M): 8.3
Speed (FPS): 18.7
visual-object-tracking-on-davis-2017AOT-B
F-measure (Mean): 85.2
Ju0026F: 82.5
Jaccard (Mean): 79.7
Params(M): 8.3
Speed (FPS): 29.6
visual-object-tracking-on-vot2022MS_AOT
EAO: 0.673

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Associating Objects with Transformers for Video Object Segmentation | Papers | HyperAI