Command Palette
Search for a command to run...
{Thai Binh Nguyen}
Abstract
Our models are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. We use wav2vec2 architecture for the pre-trained model. For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. On the Vivos dataset, we achieved a WER score of 6.15
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-recognition-on-common-voice-vi | Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI | Test WER: 11.52 |
| speech-recognition-on-vivos | Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI | Test WER: 6.15 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.
AI Co-coding
Ready-to-use GPUs
Best Pricing
Hyper Newsletters
Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp