7 months ago

Abstract

The rapid development of large-scale models has catalyzed significantbreakthroughs in the digital human domain. These advanced methodologies offerhigh-fidelity solutions for avatar driving and rendering, leading academia tofocus on the next major challenge: audio-visual dyadic interactive virtualhuman. To facilitate research in this emerging area, we present SpeakerVid-5Mdataset, the first large-scale, high-quality dataset designed for audio-visualdyadic interactive virtual human generation. Totaling over 8,743 hours,SpeakerVid-5M contains more than 5.2 million video clips of human portraits. Itcovers diverse scales and interaction types, including monadic talking,listening, and dyadic conversations. Crucially, the dataset is structured alongtwo key dimensions: interaction type and data quality. First, it is categorizedinto four types (dialogue branch, single branch, listening branch andmulti-turn branch) based on the interaction scenario. Second, it is stratifiedinto a large-scale pre-training subset and a curated, high-quality subset forSupervised Fine-Tuning (SFT). This dual structure accommodates a wide array of2D virtual human tasks. In addition, we provide an autoregressive (AR)-basedvideo chat baseline trained on this data, accompanied by a dedicated set ofmetrics and test data to serve as a benchmark VidChatBench for future work.Both the dataset and the corresponding data processing code will be publiclyreleased. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

7 months ago

Youliang Zhang Zhaoyang Li Duomin Wang Jiahe Zhang Deyu Zhou Zixin Yin Xili Dai Gang Yu Xiu Li

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

7 months ago

Youliang Zhang Zhaoyang Li Duomin Wang Jiahe Zhang Deyu Zhou Zixin Yin Xili Dai Gang Yu Xiu Li

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation | Papers | HyperAI

Command Palette

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang Zhaoyang Li Duomin Wang Jiahe Zhang Deyu Zhou Zixin Yin Xili Dai Gang Yu Xiu Li

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang Zhaoyang Li Duomin Wang Jiahe Zhang Deyu Zhou Zixin Yin Xili Dai Gang Yu Xiu Li

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang Zhaoyang Li Duomin Wang Jiahe Zhang Deyu Zhou Zixin Yin Xili Dai Gang Yu Xiu Li

Abstract

Build AI with AI

HyperAI Newsletters