Command Palette
Search for a command to run...
Jean-Baptiste Alayrac; Adrià Recasens; Rosalia Schneider; Relja Arandjelović; Jason Ramapuram; Jeffrey De Fauw; Lucas Smaira; Sander Dieleman; Andrew Zisserman

Abstract
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-classification-on-audioset | MMV | Test mAP: 0.309 |
| self-supervised-action-recognition-on | MMV | Top-1 Accuracy: 55.5 |
| self-supervised-action-recognition-on-hmdb51-1 | MMV | Top-1 Accuracy: 70.1 |
| self-supervised-action-recognition-on-ucf101 | MMV TSM-50x2 | 3-fold Accuracy: 95.2 Frozen: false Pre-Training Dataset: Audioset + Howto100M |
| self-supervised-action-recognition-on-ucf101-1 | MMV | 3-fold Accuracy: 91.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.