HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang Zhaoyang Li Duomin Wang Jiahe Zhang Deyu Zhou Zixin Yin Xili Dai Gang Yu Xiu Li

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual
  Dyadic Interactive Human Generation

Abstract

The rapid development of large-scale models has catalyzed significantbreakthroughs in the digital human domain. These advanced methodologies offerhigh-fidelity solutions for avatar driving and rendering, leading academia tofocus on the next major challenge: audio-visual dyadic interactive virtualhuman. To facilitate research in this emerging area, we present SpeakerVid-5Mdataset, the first large-scale, high-quality dataset designed for audio-visualdyadic interactive virtual human generation. Totaling over 8,743 hours,SpeakerVid-5M contains more than 5.2 million video clips of human portraits. Itcovers diverse scales and interaction types, including monadic talking,listening, and dyadic conversations. Crucially, the dataset is structured alongtwo key dimensions: interaction type and data quality. First, it is categorizedinto four types (dialogue branch, single branch, listening branch andmulti-turn branch) based on the interaction scenario. Second, it is stratifiedinto a large-scale pre-training subset and a curated, high-quality subset forSupervised Fine-Tuning (SFT). This dual structure accommodates a wide array of2D virtual human tasks. In addition, we provide an autoregressive (AR)-basedvideo chat baseline trained on this data, accompanied by a dedicated set ofmetrics and test data to serve as a benchmark VidChatBench for future work.Both the dataset and the corresponding data processing code will be publiclyreleased. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation | Papers | HyperAI