HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Tracking Anything with Decoupled Video Segmentation

Ho Kei Cheng; Seoung Wug Oh; Brian Price; Alexander Schwing; Joon-Young Lee

Tracking Anything with Decoupled Video Segmentation

Abstract

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

Code Repositories

hkchengrex/Tracking-Anything-with-DEVA
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
open-world-video-segmentation-on-burst-valDEVA (Mask2Former)
OWTA (all): 69.9
OWTA (com): 75.2
OWTA (unc): 41.5
open-world-video-segmentation-on-burst-valDEVA (EntitySeg)
OWTA (all): 69.5
OWTA (com): 73.3
OWTA (unc): 50.5
referring-expression-segmentation-on-davisDEVA (ReferFormer)
Ju0026F 1st frame: 66.3
referring-expression-segmentation-on-refer-1DEVA (ReferFormer)
Ju0026F: 66.0
semi-supervised-video-object-segmentation-on-1DEVA
F-measure (Mean): 86.8
FPS: 25.3
Ju0026F: 83.2
Jaccard (Mean): 79.6
semi-supervised-video-object-segmentation-on-21DEVA (no OVIS)
F: 64.3
FPS: 25.3
J: 55.8
Ju0026F: 60.0
semi-supervised-video-object-segmentation-on-21DEVA (with OVIS)
F: 70.8
FPS: 25.3
J: 62.3
Ju0026F: 66.5
unsupervised-video-object-segmentation-on-10DEVA (DIS)
F: 90.2
G: 88.9
J: 87.6
unsupervised-video-object-segmentation-on-4DEVA (EntitySeg)
F-measure (Mean): 76.4
Ju0026F: 73.4
Jaccard (Mean): 70.4
unsupervised-video-object-segmentation-on-5DEVA (EntitySeg)
Ju0026F: 62.1
video-panoptic-segmentation-on-vipsegDEVA (Mask2Former - SwinB)
STQ: 52.2
VPQ: 55.0
visual-object-tracking-on-davis-2017DEVA
F-measure (Mean): 91.0
Ju0026F: 87.6
Jaccard (Mean): 84.2
Speed (FPS): 25.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Tracking Anything with Decoupled Video Segmentation | Papers | HyperAI