HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

Ali Athar Sabarinath Mahadevan Aljoša Ošep Laura Leal-Taixé Bastian Leibe

STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

Abstract

Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg.

Code Repositories

sabarim/STEm-Seg
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
unsupervised-video-object-segmentation-on-4STEm-Seg
F-measure (Mean): 67.8
F-measure (Recall): 75.5
Ju0026F: 64.7
Jaccard (Mean): 61.5
Jaccard (Recall): 70.4
video-instance-segmentation-on-youtube-vis-1STEm-Seg (ResNet-101)
AP50: 55.8
AP75: 37.9
AR1: 34.4
AR10: 41.6
mask AP: 34.6
video-instance-segmentation-on-youtube-vis-1STEm-Seg (ResNet-50)
AP50: 50.7
AP75: 37.9
AR1: 34.4
AR10: 41.6
mask AP: 30.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos | Papers | HyperAI