| ST-MoE-32B 269B (fine-tuned) | 95.1 | - | ST-MoE: Designing Stable and Transferable Sparse Expert Models | |
| PaLM 540B (finetuned) | 94.0 | 94.6 | PaLM: Scaling Language Modeling with Pathways | |
| KELM (finetuning RoBERTa-large based single model) | 89.1 | 89.6 | KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs | |
| FLAN 137B (prompt-tuned) | 85.1 | - | Finetuned Language Models Are Zero-Shot Learners | |
| XLNet + MTL + Verifier (ensemble) | 83.090 | 83.737 | - | - |
| GPT-3 Large 760M (0-shot) | 82.1 | - | Language Models are Few-Shot Learners | |
| XLNet + MTL + Verifier (single model) | 81.460 | 82.664 | - | - |
| Switch Transformer 9B | 79.9 | - | Efficient Language Modeling with Sparse all-MLP | - |