Command Palette
Search for a command to run...
ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation
{Hinrich Sch{\u}tze Michael Matuschek Liliana Mamani Sanchez Ivan Bilan Suteera Seeha Johannes Huber}

Abstract
We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task. Our experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small. In such cases, the F1 Score increased by up to 2.02{%}. Even on abig labeled dataset, a small improvement gain can still be obtained. The approach has also shown to be very beneficial for out-of-domain settings with a gain in F1 Score of up to 3.13{%}. Finally, we show that ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78{%} on the standard benchmark, InterBEST2009.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| thai-word-tokenization-on-best-2010 | ThaiLMCut | F1-Score: 0.9878 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.