Command Palette
Search for a command to run...
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann Robert Kaczmarczyk Gollam Rabby Felix Friedrich Maurice Kraus Kourosh Nadi Huu Nguyen Kristian Kersting S\u00f6ren Auer

Abstract
The advancement of text-to-speech and audio generation models necessitatesrobust benchmarks for evaluating the emotional understanding capabilities of AIsystems. Current speech emotion recognition (SER) datasets often exhibitlimitations in emotional granularity, privacy concerns, or reliance on actedportrayals. This paper introduces EmoNet-Voice, a new resource for speechemotion detection, which includes EmoNet-Voice Big, a large-scale pre-trainingdataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions,and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with humanexpert annotations. EmoNet-Voice is designed to evaluate SER models on afine-grained spectrum of 40 emotion categories with different levels ofintensities. Leveraging state-of-the-art voice generation, we curated syntheticaudio snippets simulating actors portraying scenes designed to evoke specificemotions. Crucially, we conducted rigorous validation by psychology experts whoassigned perceived intensity labels. This synthetic, privacy-preservingapproach allows for the inclusion of sensitive emotional states often absent inexisting datasets. Lastly, we introduce Empathic Insight Voice models that seta new standard in speech emotion recognition with high agreement with humanexperts. Our evaluations across the current model landscape exhibit valuablefindings, such as high-arousal emotions like anger being much easier to detectthan low-arousal states like concentration.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.