
摘要
通过对消费者评论中表达的情感进行分析,可为产品品质提供丰富的洞察。尽管情感分析在多种主流语言中已得到广泛研究,但针对孟加拉语(Bangla)的研究仍相对较少,主要原因在于缺乏相关数据以及跨领域适应性差。为解决这一局限,本文提出 BanglaBook——一个大规模的孟加拉语图书评论数据集,包含158,065条样本,按情感倾向划分为正面、负面和中性三类。我们对数据集进行了详尽的统计分析,并采用多种机器学习模型构建基线性能,包括支持向量机(SVM)、长短期记忆网络(LSTM)以及孟加拉语预训练模型 Bangla-BERT。实验结果表明,预训练模型在性能上显著优于依赖人工特征工程的模型,凸显了在该领域进一步开发训练资源的必要性。此外,我们通过分析情感单字词(sentiment unigrams)开展了深入的错误分析,为资源匮乏语言如孟加拉语中的常见分类错误提供了潜在解释。本文所用代码与数据集均已公开,获取地址为:https://github.com/mohsinulkabir14/BanglaBook。
代码仓库
mohsinulkabir14/banglabook
官方
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| sentiment-analysis-on-banglabook | Logistic Regression (word 2-gram + word 3-gram) | Weighted Average F1-score: 0.8964 |
| sentiment-analysis-on-banglabook | Random Forest (word 1-gram) | Weighted Average F1-score: 0.9043 |
| sentiment-analysis-on-banglabook | Bangla-BERT (base-uncased) | Weighted Average F1-score: 0.9064 |
| sentiment-analysis-on-banglabook | XGBoost (word 2-gram + word 3-gram) | Weighted Average F1-score: 0.8651 |
| sentiment-analysis-on-banglabook | Random Forest (word 2-gram + word 3-gram) | Weighted Average F1-score: 0.9106 |
| sentiment-analysis-on-banglabook | LSTM (GloVe) | Weighted Average F1-score: 0.0991 |
| sentiment-analysis-on-banglabook | Multinomial NB (word 2-gram + word 3-gram) | Weighted Average F1-score: 0.8663 |
| sentiment-analysis-on-banglabook | Multinomial NB (BoW) | Weighted Average F1-score: 0.8564 |
| sentiment-analysis-on-banglabook | Bangla-BERT (large) | Weighted Average F1-score: 0.9331 |
| sentiment-analysis-on-banglabook | Logistic Regression (char 2-gram + char 3-gram) | Weighted Average F1-score: 0.8978 |
| sentiment-analysis-on-banglabook | SVM (word 1-gram) | Weighted Average F1-score: 0.8519 |
| sentiment-analysis-on-banglabook | SVM (word 2-gram + word 3-gram) | Weighted Average F1-score: 0.9053 |
| sentiment-analysis-on-banglabook | XGBoost (char 2-gram + char 3-gram) | Weighted Average F1-score: 0.8723 |