7 months ago

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy

Abstract

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Audio and Speech Processing

Dataset

Natural Language Processing

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Audio and Speech Processing

Dataset

Natural Language Processing

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

100,000 Podcasts: A Spoken English Document Corpus

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

100,000 Podcasts: A Spoken English Document Corpus

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

100,000 Podcasts: A Spoken English Document Corpus

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy1 more

Abstract

Build AI with AI

HyperAI Newsletters

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy

Rosie Jones Ben Carterette Jussi Karlgren Gareth Jones Maria Eskevich Hamed Bonab Rezvaneh Rezapour Aasish Pappu Yongze Yu Sravana Reddy