5 months ago

AudioCLIP: Extending CLIP to Image, Text and Audio

Guzhov Andrey ; Raue Federico ; Hees Jörn ; Dengel Andreas

Abstract

In the past, the rapidly evolving field of sound classification greatlybenefited from the application of methods from other domains. Today, we observethe trend to fuse domain-specific tasks and approaches together, which providesthe community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio inaddition to text and images. Our proposed model incorporates the ESResNeXtaudio-model into the CLIP framework using the AudioSet dataset. Such acombination enables the proposed model to perform bimodal and unimodalclassification and querying, while keeping CLIP's ability to generalize tounseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental SoundClassification (ESC) task, out-performing other approaches by reachingaccuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets.Further it sets new baselines in the zero-shot ESC-task on the same datasets(68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposedmodel as well as the influence of full and partial training on the results. Forthe sake of reproducibility, our code is published.

Code Repositories

AndreyGuzhov/AudioCLIP

Official

pytorch

Mentioned in GitHub

asteroid-team/torch-audiomentations

pytorch

Mentioned in GitHub

julirao/whisper_audio_classification

pytorch

Mentioned in GitHub

iver56/audiomentations

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
environmental-sound-classification-on	AudioCLIP	Accuracy: 90.07
environmental-sound-classification-on-esc-50	AudioCLIP	Accuracy: 97.15

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

AudioCLIP: Extending CLIP to Image, Text and Audio

Guzhov Andrey ; Raue Federico ; Hees J&#xf6;rn ; Dengel Andreas

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Guzhov Andrey ; Raue Federico ; Hees Jörn ; Dengel Andreas