HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Diacritics Restoration using BERT with Analysis on Czech language

Jakub Náplava; Milan Straka; Jana Straková

Diacritics Restoration using BERT with Analysis on Czech language

Abstract

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
croatian-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 99.73
czech-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 99.22
french-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 99.71
hungarian-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 99.41
irish-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 98.88
latvian-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 98.63
romanian-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 98.64
slovak-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 99.32
spanish-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 99.62
turkish-text-diacritization-on-multilingualBERT
Alpha-Word accuracy: 98.95
vietnamese-text-diacritization-onBERT
Alpha-Word accuracy: 98.53

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Diacritics Restoration using BERT with Analysis on Czech language | Papers | HyperAI