
摘要
卷积无疑是现代神经网络中最重要的特征变换方法,推动了深度学习的发展。然而,近年来以自注意力模块替代卷积层的Transformer网络的兴起,揭示了传统静态卷积核的局限性,并开启了动态特征变换的新时代。然而,现有的动态变换方法(包括自注意力机制)在视频理解任务中仍存在不足,因为视频中的时空对应关系——即运动信息——对于有效表征至关重要。为此,本文提出一种新型关系特征变换方法,称为关系自注意力(Relational Self-Attention, RSA),该方法通过动态生成关系核并聚合关系上下文,充分挖掘视频中丰富的时空关系结构。实验与消融研究结果表明,RSA网络显著优于传统的卷积与自注意力模型,在以运动为中心的标准视频动作识别基准数据集(如Something-Something-V1 & V2、Diving48和FineGym)上均取得了当前最优性能,达到了该领域的最新技术水平。
代码仓库
KimManjin/RSA
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-recognition-in-videos-on-something | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | Top-5 Accuracy: 91.1 |
| action-recognition-in-videos-on-something | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | Top-1 Accuracy: 64.8 Top-5 Accuracy: 89.1 |
| action-recognition-in-videos-on-something | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | Top-1 Accuracy: 66 Top-5 Accuracy: 89.8 |
| action-recognition-in-videos-on-something | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips | Top-1 Accuracy: 67.7 Top-5 Accuracy: 91.1 |
| action-recognition-in-videos-on-something | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | Top-1 Accuracy: 67.3 Top-5 Accuracy: 90.8 |
| action-recognition-in-videos-on-something-1 | RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) | Top 1 Accuracy: 56.1 Top 5 Accuracy: 82.8 |
| action-recognition-in-videos-on-something-1 | RSANet-R50 (8 frames, ImageNet pretrained, a single clip) | Top 1 Accuracy: 51.9 Top 5 Accuracy: 79.6 |
| action-recognition-in-videos-on-something-1 | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | Top 1 Accuracy: 54.0 Top 5 Accuracy: 81.1 |
| action-recognition-in-videos-on-something-1 | RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) | Top 1 Accuracy: 55.5 Top 5 Accuracy: 82.6 |
| action-recognition-on-diving-48 | RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | Accuracy: 84.2 |