
摘要
我们提出了一种基于上下文嵌入的新架构用于恢复重音符号,特别是使用了BERT模型,并在12种带有重音符号的语言上对其进行了评估。此外,我们对捷克语进行了详细的错误分析,捷克语是一种形态丰富的语言,具有较高的重音符号使用频率。值得注意的是,我们手动标注了所有误预测结果,结果显示大约44%的误预测实际上并不是错误,而是合理的变体(19%)或系统对错误数据的修正(25%)。最后,我们详细分类了真正的错误。我们在https://github.com/ufal/bert-diacritics-restoration发布了代码。
代码仓库
ufal/bert-diacritics-restoration
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| croatian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.73 |
| czech-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.22 |
| french-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.71 |
| hungarian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.41 |
| irish-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.88 |
| latvian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.63 |
| romanian-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.64 |
| slovak-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.32 |
| spanish-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 99.62 |
| turkish-text-diacritization-on-multilingual | BERT | Alpha-Word accuracy: 98.95 |
| vietnamese-text-diacritization-on | BERT | Alpha-Word accuracy: 98.53 |