Command Palette
Search for a command to run...
Temporal and cross-modal attention for audio-visual zero-shot learning
Mercea Otniel-Bogdan ; Hummel Thomas ; Koepke A. Sophia ; Akata Zeynep

Abstract
Audio-visual generalised zero-shot learning for video classification requiresunderstanding the relations between the audio and visual information in orderto be able to recognise samples from novel, previously unseen classes at testtime. The natural semantic and temporal alignment between audio and visual datain video data can be exploited to learn powerful representations thatgeneralise to unseen classes at test time. We propose a multi-modal andTemporal Cross-attention Framework (\modelName) for audio-visual generalisedzero-shot learning. Its inputs are temporally aligned audio and visual featuresthat are obtained from pre-trained networks. Encouraging the framework to focuson cross-modal correspondence across time instead of self-attention within themodalities boosts the performance significantly. We show that our proposedframework that ingests temporal features yields state-of-the-art performance onthe \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.Code for reproducing all results is available at\url{https://github.com/ExplainableML/TCAF-GZSL}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| gzsl-video-classification-on-activitynet-gzsl | TCaF | HM: 12.20 ZSL: 7.96 |
| gzsl-video-classification-on-activitynet-gzsl-1 | TCaF | HM: 10.71 ZSL: 7.91 |
| gzsl-video-classification-on-ucf-gzsl-cls | TCaF | HM: 50.78 ZSL: 44.64 |
| gzsl-video-classification-on-ucf-gzsl-main | TCaF | HM: 31.72 ZSL: 24.81 |
| gzsl-video-classification-on-vggsound-gzsl | TCaF | HM: 8.77 ZSL: 7.41 |
| gzsl-video-classification-on-vggsound-gzsl-1 | TCaF | HM: 7.33 ZSL: 6.06 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.