| FLAN 137B (few-shot, k=10) | 94.7 | Finetuned Language Models Are Zero-Shot Learners | |
| SparseGPT (175B, 50% Sparsity) | 78.87 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | |
| Memory chains and semantic supervision | 78.7 | - | - |
| SparseGPT (175B, 4:8 Sparsity) | 77.02 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | |
| SparseGPT (175B, 2:4 Sparsity) | 76.19 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | |
| sMLP – deterministic 9.4B (0-shot) | 74.7 | Efficient Language Modeling with Sparse all-MLP | - |
| GPT-3 Large 760M (zero-shot) | 72.4 | Language Models are Few-Shot Learners | |