Command Palette
Search for a command to run...
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Chu Yunfei ; Xu Jin ; Zhou Xiaohuan ; Yang Qian ; Zhang Shiliang ; Yan Zhijie ; Zhou Chang ; Zhou Jingren

Abstract
Recently, instruction-following audio-language models have received broadattention for audio interaction with humans. However, the absence ofpre-trained audio models capable of handling diverse audio types and tasks hashindered progress in this field. Consequently, most existing works have onlybeen able to support a limited range of interaction capabilities. In thispaper, we develop the Qwen-Audio model and address this limitation by scalingup audio-language pre-training to cover over 30 tasks and various audio types,such as human speech, natural sounds, music, and songs, to facilitate universalaudio understanding abilities. However, directly co-training all tasks anddatasets can lead to interference issues, as the textual labels associated withdifferent datasets exhibit considerable variations due to differences in taskfocus, language, granularity of annotation, and text structure. To overcome theone-to-many interference, we carefully design a multi-task training frameworkby conditioning on a sequence of hierarchical tags to the decoder forencouraging knowledge sharing and avoiding interference through shared andspecified tags respectively. Remarkably, Qwen-Audio achieves impressiveperformance across diverse benchmark tasks without requiring any task-specificfine-tuning, surpassing its counterparts. Building upon the capabilities ofQwen-Audio, we further develop Qwen-Audio-Chat, which allows for input fromvarious audios and text inputs, enabling multi-turn dialogues and supportingvarious audio-central scenarios.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| acoustic-scene-classification-on-cochlscene | Qwen-Audio | 1:1 Accuracy: 0.795 |
| acoustic-scene-classification-on-tut-acoustic | Qwen-Audio | 1:1 Accuracy: 0.649 |
| audio-captioning-on-clotho | Qwen-Audio | CIDEr: 0.441 SPICE: 0.136 SPIDEr: 0.288 |
| audio-classification-on-vocalsound | Qwen-Audio | Accuracy : 92.89 |
| emotion-recognition-in-conversation-on-meld | Qwen-Audio | Accuracy: 55.70 |
| speech-recognition-on-aishell-1 | Qwen-Audio | Word Error Rate (WER): 1.29 |
| speech-recognition-on-aishell-2-test-android-1 | Qwen-Audio | Word Error Rate (WER): 3.3 |
| speech-recognition-on-aishell-2-test-ios | Qwen-Audio | Word Error Rate (WER): 3.1 |
| speech-recognition-on-aishell-2-test-mic-1 | Qwen-Audio | Word Error Rate (WER): 3.3 |
| speech-recognition-on-librispeech-test-clean | Qwen-Audio | Word Error Rate (WER): 2.0 |
| speech-recognition-on-librispeech-test-other | Qwen-Audio | Word Error Rate (WER): 4.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.