HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription

Nikita Pavlichenko; Ivan Stelmakh; Dmitry Ustalov

CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription

Abstract

Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. In simple problems such as image classification, crowdsourcing has become one of the standard tools for cheap and time-efficient data collection: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing aggregation methods for more advanced applications is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech -- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing and novel aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of developing the methodology for reliable data collection via crowdsourcing. In that, we design a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY -- a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing.

Code Repositories

Toloka/CrowdSpeech
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
crowdsourced-text-aggregation-on-crowdspeechROVER
Word Error Rate (WER): 7.29
crowdsourced-text-aggregation-on-crowdspeechRASA
Word Error Rate (WER): 8.6
crowdsourced-text-aggregation-on-crowdspeechHRRASA
Word Error Rate (WER): 8.59
crowdsourced-text-aggregation-on-crowdspeech-1ROVER
Word Error Rate (WER): 13.41
crowdsourced-text-aggregation-on-crowdspeech-1HRRASA
Word Error Rate (WER): 15.66
crowdsourced-text-aggregation-on-crowdspeech-1RASA
Word Error Rate (WER): 15.67

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription | Papers | HyperAI