HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Regina Stodden Omar Momen Laura Kallmeyer

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Abstract

Text simplification is an intralingual translation task in which documents, or sentences of a complex source text are simplified for a target audience. The success of automatic text simplification systems is highly dependent on the quality of parallel data used for training and evaluation. To advance sentence simplification and document simplification in German, this paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German ("plain DE" or in German: "Einfache Sprache"). DEplain consists of a news domain (approx. 500 document pairs, approx. 13k sentence pairs) and a web-domain corpus (approx. 150 aligned documents, approx. 2k aligned sentence pairs). In addition, we are building a web harvester and experimenting with automatic alignment methods to facilitate the integration of non-aligned and to be published parallel documents. Using this approach, we are dynamically increasing the web domain corpus, so it is currently extended to approx. 750 document pairs and approx. 3.5k aligned sentence pairs. We show that using DEplain to train a transformer-based seq2seq text simplification model can achieve promising results. We make available the corpus, the adapted alignment methods for German, the web harvester and the trained models here: https://github.com/rstodden/DEPlain.

Code Repositories

rstodden/deplain
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
text-simplification-on-deplain-apa-doclong-mBART (trained on DEplain-web-doc)
BLEU: 12.913
BertScore (Precision): 0.475
FRE (Flesch Reading Ease): 59.55
SARI (EASSEu003e=0.2.1): 35.02
text-simplification-on-deplain-apa-doclong-mBART (trained on DEplain-APA-doc & DEplain-web-doc)
BLEU: 36.449
BertScore (Precision): 0.589
FRE (Flesch Reading Ease): 65.4
SARI (EASSEu003e=0.2.1): 42.862
text-simplification-on-deplain-apa-doclong-mBART (trained on DEplain-APA-doc)
BLEU: 38.136
BertScore (Precision): 0.598
FRE (Flesch Reading Ease): 65.4
SARI (EASSEu003e=0.2.1): 44.56
text-simplification-on-deplain-apa-sentmBART (trained on DEplain-APA-sent & DEplain-web-sent)
BLEU: 28.506
BertScore (Precision): 0.64
FRE (Flesch Reading Ease): 62.669
SARI (EASSEu003e=0.2.1): 34.904
text-simplification-on-deplain-apa-sentmBART (trained on DEplain-APA-sent)
BLEU: 28.25
BertScore (Precision): 0.639
FRE (Flesch Reading Ease): 63.072
SARI (EASSEu003e=0.2.1): 34.818
text-simplification-on-deplain-web-doclong-mBART (trained on DEplain-APA-doc)
BLEU: 21.9
BertScore (Precision): 0.377
FRE (Flesch Reading Ease): 64.7
SARI (EASSEu003e=0.2.1): 43.087
text-simplification-on-deplain-web-doclong-mBART (trained on DEplain-web-doc)
BLEU: 23.282
BertScore (Precision): 0.462
FRE (Flesch Reading Ease): 63.5
SARI (EASSEu003e=0.2.1): 49.584
text-simplification-on-deplain-web-doclong-mBART (trained on DEplain-APA-doc & DEplain-web-doc)
BLEU: 23.37
BertScore (Precision): 0.445
FRE (Flesch Reading Ease): 57.95
SARI (EASSEu003e=0.2.1): 49.745
text-simplification-on-deplain-web-sentmBART (trained on DEplain-APA-sent & DEplain-web-sent)
BLEU: 17.88
BertScore (Precision): 0.436
FRE (Flesch Reading Ease): 65.249
SARI (EASSEu003e=0.2.1): 34.828
text-simplification-on-deplain-web-sentmBART (trained on DEplain-APA-sent)
BLEU: 15.727
BertScore (Precision): 0.413
FRE (Flesch Reading Ease): 64.516
SARI (EASSEu003e=0.2.1): 30.867

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification | Papers | HyperAI