HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Relational Self-Attention: What's Missing in Attention for Video Understanding

Manjin Kim Heeseung Kwon Chunyu Wang Suha Kwak Minsu Cho

Relational Self-Attention: What's Missing in Attention for Video Understanding

Abstract

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Code Repositories

KimManjin/RSA
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
action-recognition-in-videos-on-somethingRSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Top-5 Accuracy: 91.1
action-recognition-in-videos-on-somethingRSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Top-1 Accuracy: 64.8
Top-5 Accuracy: 89.1
action-recognition-in-videos-on-somethingRSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Top-1 Accuracy: 66
Top-5 Accuracy: 89.8
action-recognition-in-videos-on-somethingRSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Top-1 Accuracy: 67.7
Top-5 Accuracy: 91.1
action-recognition-in-videos-on-somethingRSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Top-1 Accuracy: 67.3
Top-5 Accuracy: 90.8
action-recognition-in-videos-on-something-1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Top 1 Accuracy: 56.1
Top 5 Accuracy: 82.8
action-recognition-in-videos-on-something-1RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Top 1 Accuracy: 51.9
Top 5 Accuracy: 79.6
action-recognition-in-videos-on-something-1RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Top 1 Accuracy: 54.0
Top 5 Accuracy: 81.1
action-recognition-in-videos-on-something-1RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Top 1 Accuracy: 55.5
Top 5 Accuracy: 82.6
action-recognition-on-diving-48RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Accuracy: 84.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Relational Self-Attention: What's Missing in Attention for Video Understanding | Papers | HyperAI