HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

AudioCLIP: Extending CLIP to Image, Text and Audio

Guzhov Andrey ; Raue Federico ; Hees Jörn ; Dengel Andreas

AudioCLIP: Extending CLIP to Image, Text and Audio

Abstract

In the past, the rapidly evolving field of sound classification greatlybenefited from the application of methods from other domains. Today, we observethe trend to fuse domain-specific tasks and approaches together, which providesthe community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio inaddition to text and images. Our proposed model incorporates the ESResNeXtaudio-model into the CLIP framework using the AudioSet dataset. Such acombination enables the proposed model to perform bimodal and unimodalclassification and querying, while keeping CLIP's ability to generalize tounseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental SoundClassification (ESC) task, out-performing other approaches by reachingaccuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets.Further it sets new baselines in the zero-shot ESC-task on the same datasets(68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposedmodel as well as the influence of full and partial training on the results. Forthe sake of reproducibility, our code is published.

Code Repositories

AndreyGuzhov/AudioCLIP
Official
pytorch
Mentioned in GitHub
asteroid-team/torch-audiomentations
pytorch
Mentioned in GitHub
julirao/whisper_audio_classification
pytorch
Mentioned in GitHub
iver56/audiomentations
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
environmental-sound-classification-onAudioCLIP
Accuracy: 90.07
environmental-sound-classification-on-esc-50AudioCLIP
Accuracy: 97.15

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
AudioCLIP: Extending CLIP to Image, Text and Audio | Papers | HyperAI