| Transformer+BT (ADMIN init) | 46.4 | Very Deep Transformers for Neural Machine Translation | |
| Noisy back-translation | 45.6 | Understanding Back-Translation at Scale | |
| MUSE(Paralllel Multi-scale Attention) | 43.5 | MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning | |
| TaLK Convolutions | 43.2 | Time-aware Large Kernel Convolutions | |
| Synthesizer (Random + Vanilla) | 41.85 | Synthesizer: Rethinking Self-Attention in Transformer Models | |