| SHA-LSTM (4 layers, h=1024, no attention head) | 1.33 | 51M | Single Headed Attention RNN: Stop Thinking With Your Head | |
| Recurrent Highway Networks | 1.27 | 46M | Recurrent Highway Networks | |
| Large FS-LSTM-4 | 1.25 | 47M | Fast-Slow Recurrent Neural Networks | |
| 64-layer Character Transformer Model | 1.11 | 44M | Character-Level Language Modeling with Deeper Self-Attention | |
| SHA-RNN (4 layers, h=1024, single attention head) | 1.076 | 52M | Single Headed Attention RNN: Stop Thinking With Your Head | |
| SHA-RNN (4 layers, h=1024, attention head per layer) | 1.068 | 54M | Single Headed Attention RNN: Stop Thinking With Your Head | |
| Skip Cross-Head Transformer-XL | 1.033 | 41M | Memory-efficient Stochastic methods for Memory-based Transformers | |