HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Abstract

Spoken language models (SLMs) have emerged as a unified paradigm for speechunderstanding and generation, enabling natural human machine interaction.However, while most progress has focused on semantic accuracy and instructionfollowing, the ability of SLMs to adapt their speaking style based on spokeninstructions has received limited attention. We introduce Voice StyleAdaptation (VSA), a new task that examines whether SLMs can modify theirspeaking style, such as timbre, prosody, or persona following natural languagespoken commands. To study this task, we present VStyle, a bilingual (Chinese &English) benchmark covering four categories of speech generation: acousticattributes, natural language instruction, role play, and implicit empathy. Wealso introduce the Large Audio Language Model as a Judge (LALM as a Judge)framework, which progressively evaluates outputs along textual faithfulness,style adherence, and naturalness, ensuring reproducible and objectiveassessment. Experiments on commercial systems and open source SLMs demonstratethat current models face clear limitations in controllable style adaptation,highlighting both the novelty and challenge of this task. By releasing VStyleand its evaluation toolkit, we aim to provide the community with a foundationfor advancing human centered spoken interaction. The dataset and code arepublicly available athttps://junzhan2000.github.io/VStyle.github.io/{project's homepage}.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions | Papers | HyperAI