HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VOXLINGUA107: A DATASET FOR SPOKEN LANGUAGE RECOGNITION

{Tanel Alumae Jorgen Valk}

Abstract

This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.

Benchmarks

BenchmarkMethodologyMetrics
spoken-language-identification-onCleaned
0..5sec: 13.4
5..20sec: 6.6
Average: 7.6
spoken-language-identification-onNoisy
0..5sec: 12.3
5..20sec: 6.1
Average: 7.1
spoken-language-identification-on-kalaka-3Model on the automatically filtered (cleaned) data
EC: 0.022
EO: 0.058
PC: 0.041
PO: 0.056
spoken-language-identification-on-kalaka-3Model on the noisy data
EC: 0.033
EO: 0.059
PC: 0.055
PO: 0.083
spoken-language-identification-on-lre07Fusion of models
10 sec: 4.54
3 sec: 15.29
30 sec: 1.30
Average: 7.04
spoken-language-identification-on-lre07CNN-SAP
10 sec: 2.49
3 sec: 8.59
30 sec: 1.09
Average: 4.06
spoken-language-identification-on-lre07GMM-MMI
10 sec: 5.90
3 sec: 17.28
30 sec: 2.10
Average: 8.42
spoken-language-identification-on-lre07Phonotactic
10 sec: 6.28
3 sec: 18.59
30 sec: 1.34
Average: 8.73
spoken-language-identification-on-lre07Kaldi i-vector
10 sec: 11.93
3 sec: 26.04
30 sec: 4.52
Average: 14.17
spoken-language-identification-on-lre07Kaldi i-vector DNN
10 sec: 7.84
3 sec: 19.67
30 sec: 3.31
Average: 10.27
spoken-language-identification-on-lre07CNN-LDE
10 sec: 2.61
3 sec: 8.25
30 sec: 1.16
Average: 4.00
spoken-language-identification-on-lre07Resnet34 (cleaned data)
10 sec: 3.14
3 sec: 9.39
30 sec: 1.90
Average: 4.81
spoken-language-identification-on-lre07Resnet34 (noisy data)
10 sec: 3.33
3 sec: 10.58
30 sec: 1.72
Average: 5.21

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VOXLINGUA107: A DATASET FOR SPOKEN LANGUAGE RECOGNITION | Papers | HyperAI