Command Palette
Search for a command to run...
DeVIS: Making Deformable Transformers Work for Video Instance Segmentation
Adrià Caelles Tim Meinhardt Guillem Brasó Laura Leal-Taixé

Abstract
Video Instance Segmentation (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences. In the past, VIS methods mirrored the fragmentation of these subtasks in their architectural design, hence missing out on a joint solution. Transformers recently allowed to cast the entire VIS task as a single set-prediction problem. Nevertheless, the quadratic complexity of existing Transformer-based methods requires long training times, high memory requirements, and processing of low-single-scale feature maps. Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored. In this work, we present Deformable VIS (DeVIS), a VIS method which capitalizes on the efficiency and performance of deformable Transformers. To reason about all VIS subtasks jointly over multiple frames, we present temporal multi-scale deformable attention with instance-aware object queries. We further introduce a new image and video instance mask head with multi-scale features, and perform near-online video processing with multi-cue clip tracking. DeVIS reduces memory as well as training time requirements, and achieves state-of-the-art results on the YouTube-VIS 2021, as well as the challenging OVIS dataset. Code is available at https://github.com/acaelles97/DeVIS.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-instance-segmentation-on-ovis-1 | DeVIS (Swin-L) | AP50: 59.3 AP75: 38.3 AR1: 16.6 AR10: 39.8 mask AP: 35.5 |
| video-instance-segmentation-on-ovis-1 | DeVIS (ResNet-50) | AP50: 47.6 AP75: 20.8 AR1: 12.0 AR10: 28.9 mask AP: 23.7 |
| video-instance-segmentation-on-youtube-vis-1 | DeVIS (ResNet-50) | AP50: 66.7 AP75: 48.6 AR1: 42.4 AR10: 51.6 mask AP: 44.4 |
| video-instance-segmentation-on-youtube-vis-1 | DeVIS (Swin-L) | AP50: 80.8 AP75: 66.3 AR1: 50.8 AR10: 61.0 mask AP: 57.1 |
| video-instance-segmentation-on-youtube-vis-2 | DeVIS (Swin-L) | AP50: 77.7 AP75: 59.8 AR1: 43.8 AR10: 57.8 mask AP: 54.4 |
| video-instance-segmentation-on-youtube-vis-2 | DeVIS (ResNet-50) | AP50: 66.8 AP75: 46.6 AR1: 38.0 AR10: 50.1 mask AP: 43.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.