| Grave et al. (2016) - LSTM | - | 99.3 | - | Improving Neural Language Models with a Continuous Cache | |
| Inan et al. (2016) - Variational LSTM (tied) (h=650) | - | 87.7 | 92.3 | Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | |
| Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss | - | 87.0 | 91.5 | Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | |
| Grave et al. (2016) - LSTM + continuous cache pointer | - | 68.9 | - | Improving Neural Language Models with a Continuous Cache | |
| Melis et al. (2017) - 1-layer LSTM (tied) | 24M | 65.9 | 69.3 | On the State of the Art of Evaluation in Neural Language Models | |
| AWD-LSTM 3-layer with Fraternal dropout | 34M | 64.1 | 66.8 | Fraternal Dropout | |
| AWD-FWM Schlag et al. (2020) | 37M | 61.65 | 54.48 | Learning Associative Inference Using Fast Weight Memory | |
| AWD-LSTM-MoS + Partial Shuffle | 35M | 59.98 | 62.38 | Partially Shuffling the Training Data to Improve Language Models | |
| AWD-LSTM-DOC + Partial Shuffle | 37M | 57.85 | 60.16 | Partially Shuffling the Training Data to Improve Language Models | |
| AWD-LSTM + continuous cache pointer | 33M | 52.0 | 53.8 | Regularizing and Optimizing LSTM Language Models | |