7 months ago

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà

Abstract

We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1 300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Audio Recognition

Audio and Speech Processing

Dataset

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Audio Recognition

Audio and Speech Processing

Dataset

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization | Papers | HyperAI

Command Palette

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà1 more

Abstract

Build AI with AI

HyperAI Newsletters

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà

Alfons Juan Albert Sanchis Jorge Civera Alejandro Pérez-González-de-Martos Nahuel Roselló Pau Baquero-Arnal Javier Iranzo-Sánchez Adrià Giménez Pastor Javier Jorge Joan-Albert Silvestre-Cerdà