Command Palette
Search for a command to run...
{Marcelo Magalhães Silva de Sousa Teófilo Emidio de Campos Pedro Henrique Luz de Araujo}
Abstract
Official Gazettes are a rich source of relevant information to the public. Their careful examination may lead to the detection of frauds and irregularities that may prevent mismanagement of public funds. This paper presents a dataset composed of documents from the Official Gazette of the Federal District, containing both samples with document source annotation and unlabeled ones. We train, evaluate and compare a transfer learning based model that uses ULMFiT with traditional bag-of-words models that use SVM and Naive Bayes as classifiers. We find the SVM to be competitive, its performance being marginally worse than the ULMFiT while having much faster train and inference time and being less computationally expensive. Finally, we conduct ablation analysis to assess the performance impact of the ULMFiT parts.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-classification-on-dodf-data | SVM + tf-idf (no pre-trained vocab) | Average F1: 0.8755 Weighted F1: 0.8917 |
| text-classification-on-dodf-data | ULMFiT (pre-trained vocab, no gradual unfreezing) | Average F1: 0.8918 Weighted F1: 0.9257 |
| text-classification-on-dodf-data | SVM + word counts (pre-trained vocab) | Average F1: 0.8782 Weighted F1: 0.9049 |
| text-classification-on-dodf-data | ULMFiT (pre-trained vocab) | Average F1: 0.8374 Weighted F1: 0.9088 |
| text-classification-on-dodf-data | ULMFiT (no pre-trained vocab) | Average F1: 0.8469 Weighted F1: 0.8974 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.