HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

InstanceFormer: An Online Video Instance Segmentation Framework

Rajat Koner Tanveer Hannan Suprosanna Shit Sahand Sharifzadeh Matthias Schubert Thomas Seidl Volker Tresp

InstanceFormer: An Online Video Instance Segmentation Framework

Abstract

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

Code Repositories

rajatkoner08/instanceformer
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-instance-segmentation-on-ovis-1InstanceFormer (Swin-L)
AP50: 42.5
AP75: 21.61
AR1: 12.9
AR10: 29.3
mask AP: 22.8
video-instance-segmentation-on-ovis-1InstanceFormer(ResNet-50)
AP50: 40.7
AP75: 18.1
AR1: 12
AR10: 27.1
mask AP: 20.0
video-instance-segmentation-on-youtube-vis-1InstanceFormer(Swin-L)
AP50: 78.0
AP75: 64.2
AR1: 50.9
AR10: 61.6
mask AP: 56.3
video-instance-segmentation-on-youtube-vis-1InstanceFormer(ResNet-50)
AP50: 68.6
AP75: 49.6
AR1: 42.1
AR10: 53.5
mask AP: 45.6
video-instance-segmentation-on-youtube-vis-2InstanceFormer (Swin-L)
AP50: 73.7
AP75: 56.9
AR1: 42.8
AR10: 56.0
mask AP: 51.0
video-instance-segmentation-on-youtube-vis-2InstanceFormer (ResNet-50)
AP50: 62.4
AP75: 43.7
AR1: 36.1
AR10: 48.1
mask AP: 40.8
video-instance-segmentation-on-youtube-vis-3InstanceFormer (Swin)
AP50_L: 44.6
AP75_L: 27.3
AR10_L: 29.2
AR1_L: 25.0
mAP_L: 26.3
video-instance-segmentation-on-youtube-vis-3InstanceFormer (Resnet-50)
AP50_L: 49.5
AP75_L: 26.7
AR10_L: 30.1
AR1_L: 23.9
mAP_L: 24.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
InstanceFormer: An Online Video Instance Segmentation Framework | Papers | HyperAI