Command Palette
Search for a command to run...
Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
Tianshu Yu; Haoyu Gao; Ting-En Lin; Min Yang; Yuchuan Wu; Wentao Ma; Chao Wang; Fei Huang; Yongbin Li

Abstract
Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| emotion-recognition-in-conversation-on | SPECTRA | Accuracy: 67.94 |
| multimodal-intent-recognition-on-mintrec | SPECTRA | Accuracy (20 classes): 73.48 |
| multimodal-sentiment-analysis-on-cmu-mosei-1 | SPECTRA | Accuracy: 87.34 |
| multimodal-sentiment-analysis-on-cmu-mosi | SPECTRA | Acc-2: 87.5 |
| multimodal-sentiment-analysis-on-mosi | SPECTRA | Accuracy: 87.50 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.