Command Palette
Search for a command to run...
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Yuhao Zhang Yuhao Du Zhanchen Dai Xiangnan Ma Kaiqi Kou Benyou Wang Haizhou Li

Abstract
Speech-to-speech large language models (SLLMs) are attracting increasingattention. Derived from text-based large language models (LLMs), SLLMs oftenexhibit degradation in knowledge and reasoning capabilities. We hypothesizethat this limitation arises because current training paradigms for SLLMs failto bridge the acoustic-semantic gap in the feature representation space. Toaddress this issue, we propose EchoX, which leverages semantic representationsand dynamically generates speech training targets. This approach integratesboth acoustic and semantic learning, enabling EchoX to preserve strongreasoning abilities as a speech LLM. Experimental results demonstrate thatEchoX, with about six thousand hours of training data, achieves advancedperformance on multiple knowledge-based question-answering benchmarks. Theproject is available at https://github.com/FreedomIntelligence/EchoX.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.