HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

A vector quantized masked autoencoder for speech emotion recognition

Samir Sadok Simon Leglaive Renaud Séguier

A vector quantized masked autoencoder for speech emotion recognition

Abstract

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

Code Repositories

samsad35/VQ-MAE-S-code
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
speech-emotion-recognition-on-emodb-datasetVQ-MAE-S-12 (Frame) + Query2Emo
Accuracy: 90.2
F1: 0.891
speech-emotion-recognition-on-ravdessVQ-MAE-S-12 (Frame) + Query2Emo
Accuracy: 84.1
F1: 0.844

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
A vector quantized masked autoencoder for speech emotion recognition | Papers | HyperAI