
摘要
像SuperGLUE这样的排行榜被视为推动自然语言处理(NLP)领域持续发展的重要激励机制,因其为现代语言模型提供了公平比较的标准基准。这些排行榜促使全球顶尖的工程团队及其资源协同合作,致力于解决一系列旨在评估通用语言理解能力的任务。其模型在性能评分上常被宣称已接近甚至超越人类水平。这一现象进一步激发了对基准数据集是否存在可被基于机器学习的语言模型利用的统计线索的深入分析。针对英文数据集的研究已表明,它们往往包含标注过程中的“人工痕迹”(annotation artifacts),这使得某些任务可通过极简规则即可解决,并获得具有竞争力的排名。在本文中,我们对近期发布的俄语SuperGLUE(RSG)——即面向俄语自然语言理解的基准测试集与排行榜——进行了类似分析。结果表明,其测试数据集极易受到浅层启发式方法(shallow heuristics)的攻击。许多基于简单规则的方法在性能上不仅可与GPT-3、BERT等著名预训练语言模型相媲美,甚至在部分任务中表现更优。最简单的解释是:当前RSG排行榜上表现优异的最先进(SOTA)模型,其性能的很大一部分可能正是源于对这些浅层启发式线索的利用,而非真正意义上的语言理解能力。这表明,现有模型在RSG上的高分可能反映的是对数据中统计模式的“投机性”利用,而非对语义、推理等深层语言能力的掌握。基于上述发现,本文提出一系列改进建议,旨在优化RSG数据集的设计,减少人为痕迹与统计偏差,从而提升该排行榜在反映俄语自然语言理解领域真实进展方面的代表性和可信度。
基准测试
| 基准 | 方法 | 指标 | 
|---|---|---|
| common-sense-reasoning-on-parus | majority_class | Accuracy: 0.498 | 
| common-sense-reasoning-on-parus | heuristic majority | Accuracy: 0.478 | 
| common-sense-reasoning-on-parus | Random weighted | Accuracy: 0.48 | 
| common-sense-reasoning-on-rucos | majority_class | Average F1: 0.25 EM : 0.247 | 
| common-sense-reasoning-on-rucos | heuristic majority | Average F1: 0.26 EM : 0.257 | 
| common-sense-reasoning-on-rucos | Random weighted | Average F1: 0.25 EM : 0.247 | 
| common-sense-reasoning-on-rwsd | heuristic majority | Accuracy: 0.669 | 
| common-sense-reasoning-on-rwsd | Random weighted | Accuracy: 0.597 | 
| common-sense-reasoning-on-rwsd | majority_class | Accuracy: 0.669 | 
| natural-language-inference-on-lidirus | majority_class | MCC: 0 | 
| natural-language-inference-on-lidirus | Random weighted | MCC: 0 | 
| natural-language-inference-on-lidirus | heuristic majority | MCC: 0.147 | 
| natural-language-inference-on-rcb | heuristic majority | Accuracy: 0.438 Average F1: 0.4 | 
| natural-language-inference-on-rcb | Random weighted | Accuracy: 0.374 Average F1: 0.319 | 
| natural-language-inference-on-rcb | majority_class | Accuracy: 0.484 Average F1: 0.217 | 
| natural-language-inference-on-terra | Random weighted | Accuracy: 0.483 | 
| natural-language-inference-on-terra | heuristic majority | Accuracy: 0.549 | 
| natural-language-inference-on-terra | majority_class | Accuracy: 0.513 | 
| question-answering-on-danetqa | majority_class | Accuracy: 0.503 | 
| question-answering-on-danetqa | Random weighted | Accuracy: 0.52 | 
| question-answering-on-danetqa | heuristic majority | Accuracy: 0.642 | 
| reading-comprehension-on-muserc | Random weighted | Average F1: 0.45 EM : 0.071 | 
| reading-comprehension-on-muserc | heuristic majority | Average F1: 0.671 EM : 0.237 | 
| reading-comprehension-on-muserc | majority_class | Average F1: 0.0 EM :  0.0 | 
| word-sense-disambiguation-on-russe | heuristic majority | Accuracy: 0.595 | 
| word-sense-disambiguation-on-russe | majority_class | Accuracy: 0.587 | 
| word-sense-disambiguation-on-russe | Random weighted | Accuracy: 0.528 |