Command Palette
Search for a command to run...
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Kazakos Evangelos ; Nagrani Arsha ; Zisserman Andrew ; Damen Dima

Abstract
We focus on multi-modal fusion for egocentric action recognition, and proposea novel architecture for multi-modal temporal-binding, i.e. the combination ofmodalities within a range of temporal offsets. We train the architecture withthree modalities -- RGB, Flow and Audio -- and combine them with mid-levelfusion alongside sparse temporal sampling of fused representations. In contrastwith previous works, modalities are fused before temporal aggregation, withshared modality and fusion weights over time. Our proposed architecture istrained end-to-end, outperforming individual modalities as well as late-fusionof modalities. We demonstrate the importance of audio in egocentric vision, on per-classbasis, for identifying actions as well as interacting objects. Our methodachieves state of the art results on both the seen and unseen test sets of thelargest egocentric dataset: EPIC-Kitchens, on all metrics using the publicleaderboard.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| egocentric-activity-recognition-on-epic-1 | TBN | Actions Top-1 (S1): 34.8 Actions Top-1 (S2): 19.06 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.