
摘要
在本工作中,针对指代图像分割问题,我们并未直接预测像素级的分割掩码,而是将其建模为一系列多边形的逐步生成过程,所预测的多边形后续可转换为最终的分割掩码。这一方法得益于一种新型的序列到序列框架——Polygon Transformer(PolyFormer),该框架以图像块序列与文本查询词元序列为输入,自回归地输出多边形顶点序列。为实现更精确的几何定位,我们提出一种基于回归的解码器,可直接预测精确的浮点坐标,避免了传统方法中因坐标量化带来的误差。实验结果表明,PolyFormer在性能上显著优于现有方法,在具有挑战性的RefCOCO+和RefCOCOg数据集上分别取得了5.40%和4.52%的绝对性能提升。此外,在未进行微调的情况下,该方法在指代视频分割任务上也展现出强大的泛化能力,例如在Ref-DAVIS17数据集上取得了61.5%的J&F(交并比与F值的联合指标),表现具有竞争力。
代码仓库
amazon-science/polygon-transformer
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| referring-expression-segmentation-on-davis | PolyFormer-B | Ju0026F 1st frame: 60.9 Zero-Shot Transfer: true |
| referring-expression-segmentation-on-refcoco | PolyFormer-L | Mean IoU: 76.94 Overall IoU: 75.96 |
| referring-expression-segmentation-on-refcoco | PolyFormer-B | Overall IoU: 74.82 |
| referring-expression-segmentation-on-refcoco-3 | PolyFormer-L | Mean IoU: 72.15 Overall IoU: 69.33 |
| referring-expression-segmentation-on-refcoco-3 | PolyFormer-B | Mean IoU: 70.65 Overall IoU: 67.64 |
| referring-expression-segmentation-on-refcoco-4 | PolyFormer-B | Mean IoU: 74.51 Overall IoU: 72.89 |
| referring-expression-segmentation-on-refcoco-4 | PolyFormer-L | Mean IoU: 75.71 Overall IoU: 74.56 |
| referring-expression-segmentation-on-refcoco-5 | PolyFormer-L | Mean IoU: 66.73 Overall IoU: 61.87 |
| referring-expression-segmentation-on-refcoco-5 | PolyFormer-B | Mean IoU: 64.64 Overall IoU: 59.33 |
| referring-expression-segmentation-on-refcocog | PolyFormer-L | Mean IoU: 71.15 Overall IoU: 69.2 |
| referring-expression-segmentation-on-refcocog | PolyFormer-B | Mean IoU: 69.36 Overall IoU: 67.76 |
| referring-expression-segmentation-on-refcocog-1 | PolyFormer-L | Mean IoU: 71.17 Overall IoU: 70.19 |
| referring-expression-segmentation-on-refcocog-1 | PolyFormer-B | Mean IoU: 69.88 Overall IoU: 69.05 |
| referring-expression-segmentation-on-referit | PolyFormer-L | Mean IoU: 67.22 Overall IoU: 72.6 |
| referring-expression-segmentation-on-referit | PolyFormer-B | Mean IoU: 65.98 Overall IoU: 71.91 |