HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Xiangtai Li; Wenwei Zhang; Jiangmiao Pang; Kai Chen; Guangliang Cheng; Yunhai Tong; Chen Change Loy

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Abstract

This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic segmentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable kernels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track "things" and "stuff" in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS, KITTI-STEP, and VIPSeg without bells and whistles. In particular, on KITTI-STEP, the simple method can boost almost 12\% relative improvements over previous methods. On VIPSeg, Video K-Net boosts almost 15\% relative improvements and results in 39.8 % VPQ. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2\% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation, where we obtain 40.5% mAP for ResNet50 backbone and 54.1% mAP for Swin-base on YouTube-2019 validation set. We hope this simple, yet effective method can serve as a new, flexible baseline in unified video segmentation design. Both code and models are released at https://github.com/lxtGH/Video-K-Net.

Code Repositories

lxtgh/video-k-net
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-instance-segmentation-on-youtube-vis-1Video K-Net (Swin-Base)
AP50: 79.0
AP75: 59.6
AR1: 49.7
AR10: 59.9
mask AP: 54.1
video-panoptic-segmentation-on-cityscapes-vpsVideo K-Net (Swin-B)
VPQ: 62.2
VPQ (stuff): 71.8
VPQ (thing): 49.8
video-panoptic-segmentation-on-kitti-stepVideo K-Net (Swin-L)
AQ: 73.0
SQ: 75.0
STQ: 74.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation | Papers | HyperAI