HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Temporal and cross-modal attention for audio-visual zero-shot learning

Mercea Otniel-Bogdan ; Hummel Thomas ; Koepke A. Sophia ; Akata Zeynep

Temporal and cross-modal attention for audio-visual zero-shot learning

Abstract

Audio-visual generalised zero-shot learning for video classification requiresunderstanding the relations between the audio and visual information in orderto be able to recognise samples from novel, previously unseen classes at testtime. The natural semantic and temporal alignment between audio and visual datain video data can be exploited to learn powerful representations thatgeneralise to unseen classes at test time. We propose a multi-modal andTemporal Cross-attention Framework (\modelName) for audio-visual generalisedzero-shot learning. Its inputs are temporally aligned audio and visual featuresthat are obtained from pre-trained networks. Encouraging the framework to focuson cross-modal correspondence across time instead of self-attention within themodalities boosts the performance significantly. We show that our proposedframework that ingests temporal features yields state-of-the-art performance onthe \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.Code for reproducing all results is available at\url{https://github.com/ExplainableML/TCAF-GZSL}.

Code Repositories

explainableml/avdiff-gfsl
pytorch
Mentioned in GitHub
explainableml/tcaf-gzsl
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Temporal and cross-modal attention for audio-visual zero-shot learning | Papers | HyperAI