HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

DVIS: Decoupled Video Instance Segmentation Framework

Tao Zhang; Xingye Tian; Yu Wu; Shunping Ji; Xuebo Wang; Yuan Zhang; Pengfei Wan

DVIS: Decoupled Video Instance Segmentation Framework

Abstract

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

Code Repositories

zhang-tao-whu/DVIS
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-instance-segmentation-on-ovis-1DVIS(Swin-L, Offline)
AP50: 75.9
AP75: 53.0
AR1: 19.4
AR10: 55.3
mask AP: 49.9
video-instance-segmentation-on-ovis-1DVIS(Swin-L, Online)
AP50: 71.9
AP75: 49.2
AR1: 19.4
AR10: 52.5
mask AP: 47.1
video-instance-segmentation-on-youtube-vis-1DVIS
AP50: 88.0
AP75: 72.7
AR1: 56.5
AR10: 70.3
mask AP: 64.9
video-instance-segmentation-on-youtube-vis-2DVIS(Swin-L)
AP50: 83.0
AP75: 68.4
AR1: 47.7
AR10: 65.7
mask AP: 60.1
video-instance-segmentation-on-youtube-vis-3DVIS(Swin-L)
AP50_L: 69.0
AP75_L: 48.8
AR10_L: 51.8
AR1_L: 37.2
mAP_L: 45.9
video-panoptic-segmentation-on-vipsegDVIS(Swin-L)
STQ: 55.3
VPQ: 57.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DVIS: Decoupled Video Instance Segmentation Framework | Papers | HyperAI