
摘要
当前,视频分割领域被细分为多个涉及不同基准的任务。尽管在最先进技术方面取得了快速进展,但现有的方法大多具有特定任务性,无法在概念上推广到其他任务。受最近多任务能力方法的启发,我们提出了一种新的统一网络架构——TarViS,该架构可以应用于任何需要在视频中分割一组任意定义的“目标”的任务。我们的方法对任务如何定义这些目标具有灵活性,因为它将后者建模为抽象的“查询”,然后用于预测像素级的目标掩码。单个TarViS模型可以在涵盖不同任务的数据集集合上进行联合训练,并且在推理过程中无需任何特定任务的再训练即可在不同任务之间切换。为了证明其有效性,我们将TarViS应用于四个不同的任务:视频实例分割(VIS)、视频全景分割(VPS)、视频对象分割(VOS)和点示例引导跟踪(PET)。我们的统一且联合训练的模型在这四个任务所涵盖的7个基准中的5个上达到了最先进水平,在其余两个基准上也表现出竞争力。代码和模型权重可从以下链接获取:https://github.com/Ali2500/TarViS
代码仓库
Ali2500/TarViS
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| video-instance-segmentation-on-ovis-1 | TarViS (ResNet-50) | AP50: 52.5 AP75: 30.4 AR1: 15.9 AR10: 39.9 mask AP: 31.1 |
| video-instance-segmentation-on-ovis-1 | TarViS (Swin-L) | AP50: 67.8 AP75: 44.6 AR1: 18.0 AR10: 50.4 mask AP: 43.2 |
| video-instance-segmentation-on-ovis-1 | TarViS (Swin-T) | AP50: 55.0 AP75: 34.4 AR1: 16.1 AR10: 40.9 mask AP: 34.0 |
| video-instance-segmentation-on-youtube-vis-2 | TarViS (Swin-L) | AP50: 81.4 AP75: 67.6 AR1: 47.6 AR10: 64.8 mask AP: 60.2 |
| video-instance-segmentation-on-youtube-vis-2 | TarViS (Swin-T) | AP50: 71.6 AP75: 56.6 AR1: 42.2 AR10: 57.2 mask AP: 50.9 |
| video-instance-segmentation-on-youtube-vis-2 | TarViS (ResNet-50) | AP50: 69.6 AP75: 53.2 AR1: 40.5 AR10: 55.9 mask AP: 48.3 |
| video-panoptic-segmentation-on-cityscapes-vps | TarViS (Swin-T) | VPQ: 58.0 VPQ (stuff): 69.0 VPQ (thing): 42.9 |
| video-panoptic-segmentation-on-cityscapes-vps | TarViS (ResNet-50) | VPQ: 53.3 VPQ (stuff): 66.0 VPQ (thing): 35.9 |
| video-panoptic-segmentation-on-cityscapes-vps | TarViS (Swin-L) | VPQ: 58.9 VPQ (stuff): 69.9 VPQ (thing): 43.7 |
| video-panoptic-segmentation-on-kitti-step | TarViS (Swin-T) | AQ: 71.2 SQ: 69.9 STQ: 70.6 |
| video-panoptic-segmentation-on-kitti-step | TarViS (Swin-L) | AQ: 72.0 SQ: 72.0 STQ: 73.0 |
| video-panoptic-segmentation-on-kitti-step | TarViS (ResNet-50) | AQ: 70.3 SQ: 68.8 STQ: 69.6 |
| video-panoptic-segmentation-on-vipseg | TarViS (ResNet-50) | STQ: 43.1 VPQ: 33.5 |
| video-panoptic-segmentation-on-vipseg | TarViS (Swin-L) | STQ: 52.9 VPQ: 48.0 |
| video-panoptic-segmentation-on-vipseg | TarViS (Swin-T) | STQ: 45.3 VPQ: 35.8 |
| visual-object-tracking-on-davis-2017 | TarViS | F-measure (Mean): 88.5 Ju0026F: 85.3 Jaccard (Mean): 81.7 |