| ST-MoE-32B 269B (fine-tuned) | 99.2 | ST-MoE: Designing Stable and Transferable Sparse Expert Models | |
| GPT-3 175B (few-shot, k=32) | 92 | Language Models are Few-Shot Learners | |
| RoBERTa-Winogrande-ft 355M (fine-tuned) | 90.6 | WinoGrande: An Adversarial Winograd Schema Challenge at Scale | |