| td-LSTM (Zhang et al., 2016) | 1.63 | Architectural Complexity Measures of Recurrent Neural Networks | - |
| Large mLSTM +emb +WN +VD | 1.27 | Multiplicative LSTM for sequence modelling | |
| 12-layer Character Transformer Model | 1.18 | Character-Level Language Modeling with Deeper Self-Attention | |
| PAR Transformer 24B | 1.18 | Pay Attention when Required | |
| 64-layer Character Transformer Model | 1.13 | Character-Level Language Modeling with Deeper Self-Attention | |
| 12L Transformer + 8K adaptive span | 1.11 | Adaptive Attention Span in Transformers | |
| All-attention network - 18 layers | 1.11 | Augmenting Self-attention with Persistent Memory | |
| All-attention network - 36 layers | 1.08 | Augmenting Self-attention with Persistent Memory | |