Command Palette
Search for a command to run...
Tim Meinhardt; Matt Feiszli; Yuchen Fan; Laura Leal-Taixe; Rakesh Ranjan

Abstract
Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-instance-segmentation-on-ovis-1 | NOVIS (Swin-L) | AP50: 68.3 AP75: 43.8 AR1: 19.4 AR10: 46.9 mask AP: 43.5 |
| video-instance-segmentation-on-ovis-1 | NOVIS (ResNet-50) | AP50: 56.2 AP75: 32.6 AR1: 15.7 AR10: 37.1 mask AP: 32.7 |
| video-instance-segmentation-on-youtube-vis-1 | NOVIS (ResNet-50) | AP50: 75.7 AP75: 56.9 AR1: 50.3 AR10: 60.6 mask AP: 52.8 |
| video-instance-segmentation-on-youtube-vis-2 | NOVIS (Swin-L) | AP50: 82.0 AP75: 66.5 AR1: 47.9 AR10: 64.4 mask AP: 59.8 |
| video-instance-segmentation-on-youtube-vis-2 | NOVIS (ResNet-50) | AP50: 69.4 AP75: 50.0 AR1: 41.3 AR10: 54.4 mask AP: 47.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.