
摘要
近期,强大的预训练语言模型在大多数流行的阅读理解数据集上取得了显著的性能表现。现在是时候引入更具挑战性的数据集,以推动该领域向更加全面的文本推理方向发展。本文介绍了一个新的阅读理解数据集(ReClor),该数据集从标准化的研究生入学考试中提取而来,要求进行逻辑推理。先前的研究表明,人工标注的数据集通常包含偏差,这些偏差往往被模型利用来实现高精度而无需真正理解文本。为了全面评估模型在ReClor上的逻辑推理能力,我们提出识别有偏的数据点,并将其划分为EASY集合,其余部分则归为HARD集合。实证结果表明,最先进的模型在捕捉数据集中包含的偏差方面表现出色,在EASY集合上取得了高精度。然而,它们在HARD集合上的表现较差,几乎接近随机猜测的水平,这表明需要进一步研究以实质性地提升当前模型的逻辑推理能力。
代码仓库
yuweihao/reclor
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| logical-reasoning-question-ansering-on-reclor | XLNet-large | Accuracy: 56.0 Accuracy (easy): 75.7 Accuracy (hard): 40.5 |
| logical-reasoning-question-ansering-on-reclor | RoBERTa-large | Accuracy: 55.6 Accuracy (easy): 75.5 Accuracy (hard): 40.0 |
| logical-reasoning-question-ansering-on-reclor | BERT-large | Accuracy: 49.8 Accuracy (easy): 72.0 Accuracy (hard): 32.3 |
| machine-reading-comprehension-on-reclor | BERT-large | Accuracy: 49.8 Accuracy (easy): 72.0 Accuracy (hard): 32.3 |
| machine-reading-comprehension-on-reclor | RoBERTa-large | Accuracy: 55.6 Accuracy (easy): 75.5 Accuracy (hard): 40.0 |
| machine-reading-comprehension-on-reclor | XLNet-large | Accuracy: 56.0 Accuracy (easy): 75.7 Accuracy (hard): 40.5 |
| question-answering-on-reclor | XLNet-large | Accuracy: 56.0 Accuracy (easy): 75.7 Accuracy (hard): 40.5 |
| question-answering-on-reclor | RoBERTa-large | Accuracy: 55.6 Accuracy (easy): 75.5 Accuracy (hard): 40.0 |
| question-answering-on-reclor | BERT-large | Accuracy: 49.8 Accuracy (easy): 72.0 Accuracy (hard): 32.3 |
| reading-comprehension-on-reclor | XLNet-base | Test: 50.4 |
| reading-comprehension-on-reclor | BERT-base | Test: 47.3 |
| reading-comprehension-on-reclor | RoBERTa-base | Test: 48.5 |
| reading-comprehension-on-reclor | XLNet-large | Test: 56.0 |