| Cutoff + Relaxed Attention + LM | 37.96 | Relaxed Attention for Transformer Models | |
| Transformer + R-Drop + Cutoff | 37.90 | R-Drop: Regularized Dropout for Neural Networks | |
| Mask Attention Network (small) | 36.3 | Mask Attention Networks: Rethinking and Strengthen Transformer | |
| MUSE(Parallel Multi-scale Attention) | 36.3 | MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning | |