
摘要
视频分割的训练数据标注成本高昂。这阻碍了端到端算法在新视频分割任务中的扩展,尤其是在大规模词汇设置中。为了在不针对每个单独任务进行视频数据训练的情况下实现“跟踪任何事物”,我们开发了一种解耦的视频分割方法(DEVA),该方法由特定任务的图像级分割和类别/任务无关的双向时间传播组成。由于这种设计,我们只需要一个针对目标任务的图像级模型(其训练成本较低)和一个通用的时间传播模型,后者只需训练一次即可跨任务泛化。为了有效结合这两个模块,我们采用了双向传播技术,对来自不同帧的分割假设进行(半)在线融合,以生成连贯的分割结果。我们在多个数据稀缺的任务中展示了这种解耦公式优于端到端方法,包括大规模词汇视频全景分割、开放世界视频分割、指代视频分割和无监督视频对象分割。代码可在以下网址获取:https://hkchengrex.github.io/Tracking-Anything-with-DEVA
代码仓库
hkchengrex/Tracking-Anything-with-DEVA
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| open-world-video-segmentation-on-burst-val | DEVA (Mask2Former) | OWTA (all): 69.9 OWTA (com): 75.2 OWTA (unc): 41.5 |
| open-world-video-segmentation-on-burst-val | DEVA (EntitySeg) | OWTA (all): 69.5 OWTA (com): 73.3 OWTA (unc): 50.5 |
| referring-expression-segmentation-on-davis | DEVA (ReferFormer) | Ju0026F 1st frame: 66.3 |
| referring-expression-segmentation-on-refer-1 | DEVA (ReferFormer) | Ju0026F: 66.0 |
| semi-supervised-video-object-segmentation-on-1 | DEVA | F-measure (Mean): 86.8 FPS: 25.3 Ju0026F: 83.2 Jaccard (Mean): 79.6 |
| semi-supervised-video-object-segmentation-on-21 | DEVA (no OVIS) | F: 64.3 FPS: 25.3 J: 55.8 Ju0026F: 60.0 |
| semi-supervised-video-object-segmentation-on-21 | DEVA (with OVIS) | F: 70.8 FPS: 25.3 J: 62.3 Ju0026F: 66.5 |
| unsupervised-video-object-segmentation-on-10 | DEVA (DIS) | F: 90.2 G: 88.9 J: 87.6 |
| unsupervised-video-object-segmentation-on-4 | DEVA (EntitySeg) | F-measure (Mean): 76.4 Ju0026F: 73.4 Jaccard (Mean): 70.4 |
| unsupervised-video-object-segmentation-on-5 | DEVA (EntitySeg) | Ju0026F: 62.1 |
| video-panoptic-segmentation-on-vipseg | DEVA (Mask2Former - SwinB) | STQ: 52.2 VPQ: 55.0 |
| visual-object-tracking-on-davis-2017 | DEVA | F-measure (Mean): 91.0 Ju0026F: 87.6 Jaccard (Mean): 84.2 Speed (FPS): 25.3 |