HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Felix Wu Kwangyoun Kim Shinji Watanabe Kyu Han Ryan McDonald Kilian Q. Weinberger Yoav Artzi

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Abstract

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

Code Repositories

asappresearch/wav2seq
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
named-entity-recognition-on-slueWav2Seq (from HuBERT-large)
F1 (%): 65.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages | Papers | HyperAI