
摘要
我们介绍了Cutie,一种具有对象级记忆读取功能的视频对象分割(VOS)网络,该网络将存储在内存中的对象表示重新融入视频对象分割结果中。近期关于VOS的研究采用了自底向上的像素级记忆读取方法,这种方法由于匹配噪声的影响,尤其是在存在干扰物的情况下,导致在更具挑战性的数据集上性能较低。相比之下,Cutie通过适应一组小的对象查询来执行自顶向下的对象级记忆读取。通过这些查询,它利用基于查询的对象变换器(query-based object transformer, qt,因此称为Cutie)与自底向上的像素特征进行迭代交互。对象查询充当目标对象的高层次摘要,而高分辨率特征图则保留用于精确分割。结合前景背景掩码注意力机制,Cutie能够清晰地分离前景对象与背景的语义。在具有挑战性的MOSE数据集上,Cutie在运行时间相似的情况下比XMem提高了8.7 J&F指标,并且在速度快三倍的情况下比DeAOT提高了4.2 J&F指标。代码可在以下链接获取:https://hkchengrex.github.io/Cutie
代码仓库
hkchengrex/Cutie
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| semi-supervised-video-object-segmentation-on-1 | Cutie (base, MEGA) | F-measure (Mean): 89.9 FPS: 36.4 Ju0026F: 86.1 Jaccard (Mean): 82.4 |
| semi-supervised-video-object-segmentation-on-1 | Cutie+ (base) | F-measure (Mean): 89.2 FPS: 17.9 Ju0026F: 85.9 Jaccard (Mean): 82.6 |
| semi-supervised-video-object-segmentation-on-1 | Cutie+ (base, MEGA) | F-measure (Mean): 91.4 FPS: 17.9 Ju0026F: 88.1 Jaccard (Mean): 84.7 |
| semi-supervised-video-object-segmentation-on-18 | Cutie+ (base, MEGA) | F-Measure (Seen): 90.6 F-Measure (Unseen): 90.5 Ju0026F: 17.9 Jaccard (Seen): 86.3 Jaccard (Unseen): 82.7 Overall: 87.5 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (small, MEGA) | F: 72.9 FPS: 45.5 J: 64.3 Ju0026F: 68.6 |
| semi-supervised-video-object-segmentation-on-21 | Cutie+ (base, MEGA) | F: 75.8 FPS: 17.9 J: 67.6 Ju0026F: 71.7 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (base) | F: 67.9 FPS: 36.4 J: 60.0 Ju0026F: 64.0 |
| semi-supervised-video-object-segmentation-on-21 | Cutie+ (small, MEGA) | F: 74.5 FPS: 20.6 J: 66.0 Ju0026F: 70.3 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (small) | F: 66.2 FPS: 45.5 J: 58.2 Ju0026F: 62.2 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (base, with mose) | F: 72.3 FPS: 36.4 J: 64.2 Ju0026F: 68.3 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (base, MEGA) | F: 74.1 FPS: 36.4 J: 65.8 Ju0026F: 69.9 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (small, with mose) | F: 71.7 FPS: 45.5 J: 63.1 Ju0026F: 67.4 |
| semi-supervised-video-object-segmentation-on-22 | Cutie (base, with mose, 600 pixels) | HOTA (all): 58.4 HOTA (common): 61.8 HOTA (uncommon): 57.5 |
| semi-supervised-video-object-segmentation-on-22 | Cutie (base, MEGA, 600 pixels) | HOTA (all): 61.2 HOTA (common): 65.0 HOTA (uncommon): 60.3 |
| semi-supervised-video-object-segmentation-on-23 | Cutie (base, MEGA, 600 pixels) | HOTA (all): 66.0 HOTA (common): 66.5 HOTA (uncommon): 65.9 |
| semi-supervised-video-object-segmentation-on-23 | Cutie (base, with mose, 600 pixels) | HOTA (all): 62.6 HOTA (common): 63.8 HOTA (uncommon): 62.3 |
| video-object-segmentation-on-mose | Cutie | Ju0026F: 68.3 |
| video-object-segmentation-on-youtube-vos | Cutie+ (base, MEGA) | F-Measure (Seen): 91.0 F-Measure (Unseen): 90.1 Jaccard (Seen): 86.6 Jaccard (Unseen): 82.2 Overall: 87.5 Speed (FPS): 17.9 |
| visual-object-tracking-on-davis-2017 | Cutie+ (base, MEGA) | F-measure (Mean): 90.8 Ju0026F: 88.1 Jaccard (Mean): 85.5 Speed (FPS): 17.9 |
| visual-object-tracking-on-davis-2017 | Cutie (base) | F-measure (Mean): 91.1 Ju0026F: 87.9 Jaccard (Mean): 84.6 Params(M): 36.4 |
| visual-object-tracking-on-davis-2017 | Cutie+ (base) | F-measure (Mean): 93.4 Ju0026F: 90.5 Jaccard (Mean): 87.5 Params(M): 17.9 |
| visual-object-tracking-on-didi | Cutie | Tracking quality: 0.575 |