
摘要
我们提出了一种高分辨率Transformer(HRFormer),用于学习密集预测任务中的高分辨率表征,这与原始视觉Transformer(Vision Transformer)生成低分辨率表征且具有高内存和计算开销的特点形成对比。我们借鉴了高分辨率卷积网络(HRNet)中提出的多分辨率并行结构,并结合局部窗口自注意力机制(local-window self-attention),该机制在不重叠的小图像窗口上执行自注意力操作,从而提升内存与计算效率。此外,我们在前馈网络(FFN)中引入卷积操作,以在彼此分离的图像窗口之间实现信息交互。实验表明,HRFormer在人体姿态估计与语义分割任务上均表现出色。例如,在COCO姿态估计任务中,HRFormer以50%更少的参数量和30%更少的浮点运算次数(FLOPs),实现了比Swin Transformer高出1.3 AP的性能。代码已开源,地址为:https://github.com/HRNet/HRFormer。
代码仓库
HRNet/HRFormer
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-imagenet | HRFormer-B | GFLOPs: 13.7 Number of params: 50.3M Top 1 Accuracy: 82.8% |
| image-classification-on-imagenet | HRFormer-T | GFLOPs: 1.8 Number of params: 8.0M Top 1 Accuracy: 78.5% |
| multi-person-pose-estimation-on-crowdpose | HRFormer-B | AP Easy: 80.0 AP Hard: 62.4 AP Medium: 73.5 mAP @0.5:0.95: 72.4 |
| multi-person-pose-estimation-on-ochuman | HRFormer-B | AP50: 81.4 AP75: 67.1 Validation AP: 62.1 |
| pose-estimation-on-aic | HRFormer (HRFomer-S) | AP: 31.6 AP75: 20.9 AR: 35.8 AR50: 78.0 |
| pose-estimation-on-aic | HRFormer (HRFomer-B) | AP: 34.4 AP50: 78.3 AP75: 24.8 AR: 38.7 AR50: 80.9 |
| pose-estimation-on-coco-test-dev | HRFormer-B | AP: 76.2 AP50: 92.7 AP75: 83.8 APL: 82.3 APM: 72.5 AR: 81.2 |