
摘要
长短期记忆网络(LSTMs)和其他递归神经网络(RNN)变体在字符级语言建模中表现出色。这些模型通常使用截断时间反向传播进行训练,人们普遍认为它们的成功源于其能够记住长期上下文的能力。在本文中,我们展示了具有固定上下文的深度(64层)变换器模型显著优于RNN变体,在两个流行的基准测试中达到了最先进的水平:text8上的每字符1.13比特和enwik8上的每字符1.06比特。为了在如此深的网络中获得良好的结果,我们证明了在中间网络层和中间序列位置添加辅助损失的重要性。
代码仓库
facebookresearch/code-prediction-transformer
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| language-modelling-on-enwiki8 | 64-layer Character Transformer Model | Bit per Character (BPC): 1.11 Number of params: 44M |
| language-modelling-on-enwiki8 | Transformer (64 layers) | Bit per Character (BPC): 1.06 Number of params: 235M |
| language-modelling-on-hutter-prize | 64-layer Character Transformer Model | Bit per Character (BPC): 1.06 Number of params: 235M |
| language-modelling-on-hutter-prize | 12-layer Character Transformer Model | Bit per Character (BPC): 1.11 Number of params: 44M |
| language-modelling-on-text8 | 12-layer Character Transformer Model | Bit per Character (BPC): 1.18 Number of params: 44M |
| language-modelling-on-text8 | 64-layer Character Transformer Model | Bit per Character (BPC): 1.13 Number of params: 235M |