| Zaremba et al. (2014) - LSTM (medium) | - | 82.7 | 86.2 | Recurrent Neural Network Regularization | |
| Gal & Ghahramani (2016) - Variational LSTM (medium) | - | 79.7 | 81.9 | A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | |
| Zaremba et al. (2014) - LSTM (large) | - | 78.4 | 82.2 | Recurrent Neural Network Regularization | |
| Gal & Ghahramani (2016) - Variational LSTM (large) | - | 75.2 | 77.9 | A Theoretically Grounded Application of Dropout in Recurrent Neural Networks | |
| Inan et al. (2016) - Variational RHN | - | 66.0 | 68.1 | Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling | |
| Recurrent highway networks | 23M | 65.4 | 67.9 | Recurrent Highway Networks | |
| AWD-LSTM 3-layer with Fraternal dropout | 24M | 56.8 | 58.9 | Fraternal Dropout | |
| Differentiable NAS | 23M | 56.1 | 58.3 | DARTS: Differentiable Architecture Search | |
| 2-layer skip-LSTM + dropout tuning | 24M | 55.3 | 57.1 | Pushing the bounds of dropout | |