HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Sarkar Pritam ; Etemad Ali

Self-Supervised Audio-Visual Representation Learning with Relaxed
  Cross-Modal Synchronicity

Abstract

We present CrissCross, a self-supervised framework for learning audio-visualrepresentations. A novel notion is introduced in our framework whereby inaddition to learning the intra-modal and standard 'synchronous' cross-modalrelations, CrissCross also learns 'asynchronous' cross-modal relationships. Weperform in-depth studies showing that by relaxing the temporal synchronicitybetween the audio and visual modalities, the network learns strong generalizedrepresentations useful for a variety of downstream tasks. To pretrain ourproposed solution, we use 3 different datasets with varying sizes,Kinetics-Sound, Kinetics400, and AudioSet. The learned representations areevaluated on a number of downstream tasks namely action recognition, soundclassification, and action retrieval. Our experiments show that CrissCrosseither outperforms or achieves performances on par with the currentstate-of-the-art self-supervised methods on action recognition and actionretrieval with UCF101 and HMDB51, as well as sound classification with ESC50and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining whilepretrained on Kinetics-Sound. The codes and pretrained models are available onthe project website.

Code Repositories

pritamqu/CrissCross
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-classification-on-dcaseCrissCross (Kinetics-400)
PRE-TRAINING DATASET: Kinetics-400
Top-1 Accuracy: 96
audio-classification-on-dcaseCrissCross (AudioSet)
PRE-TRAINING DATASET: AudioSet
Top-1 Accuracy: 97
audio-classification-on-dcaseCrissCross (Kinetics-Sound)
PRE-TRAINING DATASET: Kinetics-Sound
Top-1 Accuracy: 93
self-supervised-action-recognition-on-hmdb51CrissCross (AudioSet)
Frozen: false
Pre-Training Dataset: AudioSet
Top-1 Accuracy: 66.8
self-supervised-action-recognition-on-hmdb51CrissCross (Kinetics400)
Frozen: false
Pre-Training Dataset: Kinetics400
Top-1 Accuracy: 64.7
self-supervised-action-recognition-on-hmdb51CrissCross (Kinetics-Sound)
Frozen: false
Pre-Training Dataset: Kinetics-Sound
Top-1 Accuracy: 60.5
self-supervised-action-recognition-on-ucf101CrissCross (Kinetics400)
3-fold Accuracy: 91.5
Frozen: false
Pre-Training Dataset: Kinetics400
self-supervised-action-recognition-on-ucf101CrissCross (Kinetics-Sound)
3-fold Accuracy: 88.3
Frozen: false
Pre-Training Dataset: Kinetics-Sound
self-supervised-action-recognition-on-ucf101CrissCross (AudioSet)
3-fold Accuracy: 92.4
Frozen: false
Pre-Training Dataset: AudioSet

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | Papers | HyperAI