Command Palette
Search for a command to run...
Kim Manjin ; Seo Paul Hongsuck ; Schmid Cordelia ; Cho Minsu

Abstract
We introduce a new attention mechanism, dubbed structural self-attention(StructSA), that leverages rich correlation patterns naturally emerging inkey-query interactions of attention. StructSA generates attention maps byrecognizing space-time structures of key-query correlations via convolution anduses them to dynamically aggregate local contexts of value features. Thiseffectively leverages rich structural patterns in images and videos such asscene layouts, object motion, and inter-object relations. Using StructSA as amain building block, we develop the structural vision transformer (StructViT)and evaluate its effectiveness on both image and video classification tasks,achieving state-of-the-art results on ImageNet-1K, Kinetics-400,Something-Something V1 & V2, Diving-48, and FineGym.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | StructViT-B-4-1 | Acc@1: 83.4 |
| action-recognition-in-videos-on-something | StructVit-B-4-1 | Top-1 Accuracy: 71.5 |
| action-recognition-in-videos-on-something-1 | StructVit-B-4-1 | Top 1 Accuracy: 61.3 |
| action-recognition-on-diving-48 | StructVit-B-4-1 | Accuracy: 88.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.