
摘要
本文通过增强视频级视觉-语言对齐,研究指代视频目标分割(Referring Video Object Segmentation, RVOS)任务。现有方法通常将RVOS建模为序列预测问题,对每一帧独立进行多模态交互与分割操作。然而,由于缺乏对视频内容的整体性理解,这些方法在有效利用帧间关系以及理解描述对象随时间变化的文本信息方面存在困难。为解决这一问题,本文提出语义辅助目标聚类(Semantic-assisted Object Cluster, SOC),该方法统一聚合视频内容与文本引导信息,实现统一的时序建模与跨模态对齐。通过将一组帧级目标嵌入与语言标记相关联,SOC促进了跨模态与跨时间步的联合表征学习。此外,我们设计了多模态对比监督机制,以在视频层级上构建高质量对齐的联合表征空间。我们在多个主流RVOS基准数据集上进行了大量实验,结果表明,所提方法在所有基准上均显著优于当前最先进的方法。同时,对时序一致性的强调显著提升了模型在处理具有时序变化特征的文本描述时的分割稳定性与适应能力。代码将公开发布。
代码仓库
RobertLuo1/NeurIPS2023_SOC
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| referring-expression-segmentation-on-a2d | SOC (Video-Swin-B) | AP: 0.573 IoU mean: 0.725 IoU overall: 0.807 Precision@0.5: 0.851 Precision@0.6: 0.827 Precision@0.7: 0.765 Precision@0.8: 0.607 Precision@0.9: 0.252 |
| referring-expression-segmentation-on-a2d | SOC (Video-Swin-T) | AP: 0.504 IoU mean: 0.669 IoU overall: 0.747 Precision@0.5: 0.79 Precision@0.6: 0.756 Precision@0.7: 0.687 Precision@0.8: 0.535 Precision@0.9: 0.195 |
| referring-expression-segmentation-on-j-hmdb | SOC (Video-Swin-B) | AP: 0.446 IoU mean: 0.723 IoU overall: 0.736 Precision@0.5: 0.969 Precision@0.6: 0.914 Precision@0.7: 0.711 Precision@0.8: 0.213 Precision@0.9: 0.001 |
| referring-expression-segmentation-on-j-hmdb | SOC (Video-Swin-T) | AP: 0.397 IoU mean: 0.701 IoU overall: 0.707 Precision@0.5: 0.947 Precision@0.6: 0.864 Precision@0.7: 0.627 Precision@0.8: 0.179 Precision@0.9: 0.001 |
| referring-expression-segmentation-on-refer-1 | SOC (Video-Swin-T) | F: 60.5 J: 57.8 Ju0026F: 59.2 |
| referring-expression-segmentation-on-refer-1 | SOC (Joint training, Video-Swin-B) | F: 69.3 J: 65.3 Ju0026F: 67.3±0.5 |
| referring-video-object-segmentation-on-ref | SOC | F: 69.1 J: 62.5 Ju0026F: 65.8 |
| referring-video-object-segmentation-on-refer | SOC | F: 67.9 J: 64.1 Ju0026F: 66.0 |