3 months ago

Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data

{Marcin Junczys-Dowmunt Roman Grundkiewicz Kenneth Heafield}

Abstract

Considerable effort has been made to address the data sparsity problem in neural grammatical error correction. In this work, we propose a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data. Synthetic data is used to pre-train a Transformer sequence-to-sequence model, which not only improves over a strong baseline trained on authentic error-annotated data, but also enables the development of a practical GEC system in a scenario where little genuine error-annotated data is available. The developed systems placed first in the BEA19 shared task, achieving 69.47 and 64.24 F$_{0.5}$ in the restricted and low-resource tracks respectively, both on the W{&}I+LOCNESS test set. On the popular CoNLL 2014 test set, we report state-of-the-art results of 64.16 M{mbox{$^2$}} for the submitted system, and 61.30 M{mbox{$^2$}} for the constrained system trained on the NUCLE and Lang-8 data.

Benchmarks

Benchmark	Methodology	Metrics
grammatical-error-correction-on-bea-2019-test	Transformer	F0.5: 69.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning