Command Palette
Search for a command to run...
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
Yunshui Li; Binyuan Hui; ZhiChao Yin; Min Yang; Fei Huang; Yongbin Li

Abstract
Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| dialogue-state-tracking-on-mmconv | PaCE | Categorical Accuracy: 92.2 Non-Categorical Accuracy: 43.4 Overall: 39.2 |
| dialogue-state-tracking-on-simmc2-0 | PaCE | Act F1: 97.1 Slot F1: 87.0 |
| image-retrieval-on-photochat | PaCE | R1: 15.2 R@10: 49.6 R@5: 36.7 Sum(R@1,5,10): 101.5 |
| multimodal-intent-recognition-on-mmdialog | PaCE | F1: 77.6 |
| multimodal-intent-recognition-on-photochat | PaCE | F1: 63.8 Precision: 63.3 Recall: 68 |
| response-generation-on-mmconv | PaCE | BLEU: 22 Comb.: 44.7 Inform: 34.5 Success: 13.9 |
| response-generation-on-simmc2-0 | PaCE | BLEU: 34.1 |
| text-retrieval-on-image-chat | PaCE | R@1: 51.9 R@5: 76.8 Sum(R@1,5): 128.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.