5 months ago

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Chu Yunfei ; Xu Jin ; Zhou Xiaohuan ; Yang Qian ; Zhang Shiliang ; Yan Zhijie ; Zhou Chang ; Zhou Jingren

Abstract

Recently, instruction-following audio-language models have received broadattention for audio interaction with humans. However, the absence ofpre-trained audio models capable of handling diverse audio types and tasks hashindered progress in this field. Consequently, most existing works have onlybeen able to support a limited range of interaction capabilities. In thispaper, we develop the Qwen-Audio model and address this limitation by scalingup audio-language pre-training to cover over 30 tasks and various audio types,such as human speech, natural sounds, music, and songs, to facilitate universalaudio understanding abilities. However, directly co-training all tasks anddatasets can lead to interference issues, as the textual labels associated withdifferent datasets exhibit considerable variations due to differences in taskfocus, language, granularity of annotation, and text structure. To overcome theone-to-many interference, we carefully design a multi-task training frameworkby conditioning on a sequence of hierarchical tags to the decoder forencouraging knowledge sharing and avoiding interference through shared andspecified tags respectively. Remarkably, Qwen-Audio achieves impressiveperformance across diverse benchmark tasks without requiring any task-specificfine-tuning, surpassing its counterparts. Building upon the capabilities ofQwen-Audio, we further develop Qwen-Audio-Chat, which allows for input fromvarious audios and text inputs, enabling multi-turn dialogues and supportingvarious audio-central scenarios.

Code Repositories

qwenlm/qwen-audio

Official

pytorch

Mentioned in GitHub

alibaba-damo-academy/FunASR

pytorch

Benchmarks

Benchmark	Methodology	Metrics
acoustic-scene-classification-on-cochlscene	Qwen-Audio	1:1 Accuracy: 0.795
acoustic-scene-classification-on-tut-acoustic	Qwen-Audio	1:1 Accuracy: 0.649
audio-captioning-on-clotho	Qwen-Audio	CIDEr: 0.441 SPICE: 0.136 SPIDEr: 0.288
audio-classification-on-vocalsound	Qwen-Audio	Accuracy : 92.89
emotion-recognition-in-conversation-on-meld	Qwen-Audio	Accuracy: 55.70
speech-recognition-on-aishell-1	Qwen-Audio	Word Error Rate (WER): 1.29
speech-recognition-on-aishell-2-test-android-1	Qwen-Audio	Word Error Rate (WER): 3.3
speech-recognition-on-aishell-2-test-ios	Qwen-Audio	Word Error Rate (WER): 3.1
speech-recognition-on-aishell-2-test-mic-1	Qwen-Audio	Word Error Rate (WER): 3.3
speech-recognition-on-librispeech-test-clean	Qwen-Audio	Word Error Rate (WER): 2.0
speech-recognition-on-librispeech-test-other	Qwen-Audio	Word Error Rate (WER): 4.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette