| Noisy back-translation | 35.0 | 146G | | Understanding Back-Translation at Scale | |
| Transformer + R-Drop | 30.91 | 49G | | R-Drop: Regularized Dropout for Neural Networks | |
| Data Diversification - Transformer | 30.7 | | | Data Diversification: A Simple Strategy For Neural Machine Translation | |
| Mask Attention Network (big) | 30.4 | | | Mask Attention Networks: Rethinking and Strengthen Transformer | |
| MUSE(Parallel Multi-scale Attention) | 29.9 | | | MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning | |
| TaLK Convolutions | 29.6 | - | - | Time-aware Large Kernel Convolutions | |