{Te{\'o}filo Em{\'\i}dio de CamposPedro Henrique Luz de AraujoNilton Correia da SilvaFabricio Ataides Braz}

摘要
本文介绍了VICTOR,一个基于巴西高等法院数字化法律文件构建的新型数据集。该数据集包含超过4.5万份上诉案件,涵盖约69.2万份文档,总计约460万页。数据集包含标注的文本数据,支持两类任务:文档类型分类和主题标注(一种多标签分类问题)。我们采用词袋模型、卷积神经网络、循环神经网络以及提升算法(boosting algorithms)进行了基线实验。此外,我们还尝试使用线性链条件随机场(linear-chain Conditional Random Fields)以利用诉讼文件的序列特性,结果表明该方法在文档类型分类任务上取得了性能提升。最后,我们对比了两种主题分类策略:一种是基于领域知识筛选出信息量较低的文档页面,另一种是默认使用所有页面。与法院专家的预期相反,实验结果表明,使用全部可用数据的方法表现更优。为促进更优模型与技术的探索,我们以三种不同规模和内容的版本公开发布该数据集。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| multi-label-text-classification-on-bvictor | XGBoost | Average F1: 0.8843 Weighted F1: 0.8957 |
| multi-label-text-classification-on-bvictor | SVM | Average F1: 0.7761 Weighted F1: 0.8235 |
| multi-label-text-classification-on-bvictor | NB | Average F1: 0.6335 Weighted F1: 0.6955 |
| multi-label-text-classification-on-mvictor | SVM | Average F1: 0.6642 Weighted F1: 0.8137 |
| multi-label-text-classification-on-mvictor | NB | Average F1: 0.3797 Weighted F1: 0.6062 |
| multi-label-text-classification-on-mvictor | XGBoost | Average F1: 0.8882 Weighted F1: 0.9072 |
| multi-label-text-classification-on-svictor | SVM | Average F1: 0.8246 Weighted F1: 0.8231 |
| multi-label-text-classification-on-svictor | NB | Average F1: 0.5121 Weighted F1: 0.4875 |
| multi-label-text-classification-on-svictor | XGBoost | Average F1: 0.8887 Weighted F1: 0.8634 |
| text-classification-on-mvictor-type | BiLSTM | Average F1: 0.7092 Weighted F1: 0.9433 |
| text-classification-on-mvictor-type | CNN | Average F1: 0.7061 Weighted F1: 0.9464 |
| text-classification-on-mvictor-type | SVM | Average F1: 0.6792 Weighted F1: 0.9288 |
| text-classification-on-mvictor-type | CNN + CRF | Average F1: 0.7505 Weighted F1: 0.9537 |
| text-classification-on-mvictor-type | NB | Average F1: 0.4772 Weighted F1: 0.8477 |
| text-classification-on-svictor-type | SVM | Average F1: 0.7632 Weighted F1: 0.9425 |
| text-classification-on-svictor-type | BiLSTM | Average F1: 0.7281 Weighted F1: 0.9465 |
| text-classification-on-svictor-type | NB | Average F1: 0.5979 Weighted F1: 0.8893 |
| text-classification-on-svictor-type | CNN + CRF | Average F1: 0.7740 Weighted F1: 0.9533 |
| text-classification-on-svictor-type | CNN | Average F1: 0.7584 Weighted F1: 0.9472 |