Command Palette
Search for a command to run...
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Abstract
Spoken language models (SLMs) have emerged as a unified paradigm for speechunderstanding and generation, enabling natural human machine interaction.However, while most progress has focused on semantic accuracy and instructionfollowing, the ability of SLMs to adapt their speaking style based on spokeninstructions has received limited attention. We introduce Voice StyleAdaptation (VSA), a new task that examines whether SLMs can modify theirspeaking style, such as timbre, prosody, or persona following natural languagespoken commands. To study this task, we present VStyle, a bilingual (Chinese &English) benchmark covering four categories of speech generation: acousticattributes, natural language instruction, role play, and implicit empathy. Wealso introduce the Large Audio Language Model as a Judge (LALM as a Judge)framework, which progressively evaluates outputs along textual faithfulness,style adherence, and naturalness, ensuring reproducible and objectiveassessment. Experiments on commercial systems and open source SLMs demonstratethat current models face clear limitations in controllable style adaptation,highlighting both the novelty and challenge of this task. By releasing VStyleand its evaluation toolkit, we aim to provide the community with a foundationfor advancing human centered spoken interaction. The dataset and code arepublicly available athttps://junzhan2000.github.io/VStyle.github.io/{project's homepage}.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.