Command Palette
Search for a command to run...
Jakub Náplava; Milan Straka; Jana Straková

Abstract
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| croatian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.73 |
| czech-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.22 |
| french-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.71 |
| hungarian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.41 |
| irish-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.88 |
| latvian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.63 |
| romanian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.64 |
| slovak-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.32 |
| spanish-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.62 |
| turkish-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.95 |
| vietnamese-text-diacritization-on | BERT | Alpha-Word accuracy: 98.53 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.