
摘要
本文介绍了一种针对语言模型数据高效预训练的新型变压器架构修改方法。该方法通过参与BabyLM挑战赛进行了评估,在严格赛道和严格小型赛道中均取得了胜利。我们的方法允许每个变压器层选择要处理的前一层输出。实证结果验证了这一简单修改的潜力,并表明并非所有层都同等重要。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| linguistic-acceptability-on-cola | LTG-BERT-base 98M | Accuracy: 82.7 |
| linguistic-acceptability-on-cola | ELC-BERT-base 98M | Accuracy: 82.6 |
| linguistic-acceptability-on-cola | LTG-BERT-small 24M | Accuracy: 77.6 |
| linguistic-acceptability-on-cola | ELC-BERT-small 24M | Accuracy: 76.1 |
| natural-language-inference-on-multinli | ELC-BERT-base 98M (zero init) | Matched: 84.4 Mismatched: 84.5 |
| natural-language-inference-on-multinli | ELC-BERT-small 24M | Matched: 79.2 Mismatched: 79.9 |
| natural-language-inference-on-multinli | LTG-BERT-small 24M | Matched: 78 Mismatched: 78.8 |
| natural-language-inference-on-multinli | LTG-BERT-base 98M | Matched: 83 Mismatched: 83.4 |
| natural-language-inference-on-rte | LTG-BERT-small 24M | Accuracy: 53.7 |
| natural-language-inference-on-rte | ELC-BERT-small 24M | Accuracy: 55.4 |
| natural-language-inference-on-rte | ELC-BERT-base 98M (zero init) | Accuracy: 63 |
| natural-language-inference-on-rte | LTG-BERT-base 98M | Accuracy: 54.7 |