3 months ago

Vietnamese end-to-end speech recognition using wav2vec 2.0

{Thai Binh Nguyen}

Abstract

Our models are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. We use wav2vec2 architecture for the pre-trained model. For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. On the Vivos dataset, we achieved a WER score of 6.15

Benchmarks

Benchmark	Methodology	Metrics
speech-recognition-on-common-voice-vi	Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI	Test WER: 11.52
speech-recognition-on-vivos	Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI	Test WER: 6.15

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning