| Transformer+BT (ADMIN init) | 46.4 | Very Deep Transformers for Neural Machine Translation |  | 
| Noisy back-translation | 45.6 | Understanding Back-Translation at Scale |  | 
| MUSE(Paralllel Multi-scale Attention) | 43.5 | MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning |  | 
| TaLK Convolutions | 43.2 | Time-aware Large Kernel Convolutions |  | 
| Synthesizer (Random + Vanilla) | 41.85 | Synthesizer: Rethinking Self-Attention in Transformer Models |  |