HyperAIHyperAI

Command Palette

Search for a command to run...

WenetSpeech Yue Cantonese Corpus Dataset

Date

2 months ago

Size

1.46 GB

Organization

AISHELL
China Telecom
Northwestern Polytechnical University

Paper URL

2509.03959

License

Non-Commercial

WenetSpeech Yue is a multi-dimensional annotated large-scale speech corpus for Cantonese speech recognition (ASR) and text-to-speech synthesis (TTS) released in 2025 by Northwestern Polytechnical University, China Telecom Artificial Intelligence Research Institute, Beijing Hill Shell Technology Co., Ltd. and other institutions. The related paper results are "WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation", which aims to fill the gap in the lack of resources in the Cantonese field and promote the training and evaluation of high-quality Cantonese models.

The dataset contains approximately 21,800 hours of Cantonese recordings, covering 10 domains, including: storytelling, entertainment, drama, culture, Vlog, commentary, education, podcasts, news, and others. It is suitable for the training and evaluation of Cantonese automatic speech recognition (ASR) and text-to-speech synthesis (TTS) models, as well as for processing diverse domains and speaking styles in real language scenarios. It also supports the verification and evaluation of cross-domain generalization capabilities.

Data composition:

  • Transcribed text: Automatic speech recognition results;
  • Confidence scores: such as text confidence and Cantonese pinyin confidence;
  • Speaker attributes: gender, age, speaker ID;
  • Voice quality indicators: such as SNR and DNSMOS;
  • Time annotation: duration, character-level timestamp;
  • Extended metadata: program name, region, link and register information.

WenetSpeech-Yue.torrent
Seeding 1Downloading 0Completed 34Total Downloads 70
  • WenetSpeech-Yue/
    • README.md
      2.12 KB
    • README.txt
      4.23 KB
      • data/
        • WenetSpeech-Yue.zip
          1.46 GB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
WenetSpeech Yue Cantonese Corpus Dataset | Datasets | HyperAI