| PaLM 2 (few-shot, CoT, SC) | 95.1 | PaLM 2 Technical Report | |
| Shivaay (4B, few-shot, k=8) | 91.04 | - | - |
| Claude 2 (few-shot, k=5) | 91 | Model Card and Evaluations for Claude Models | - |
| Claude 1.3 (few-shot, k=5) | 90 | Model Card and Evaluations for Claude Models | - |
| PaLM 540B (Self Improvement, Self Consistency) | 89.8 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Consistency) | 88.7 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Improvement, CoT Prompting) | 88.3 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Improvement, Standard-Prompting) | 87.2 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Standard-Prompting) | 87.1 | Large Language Models Can Self-Improve | - |
| ST-MoE-32B 269B (fine-tuned) | 86.5 | ST-MoE: Designing Stable and Transferable Sparse Expert Models | |
| Claude Instant 1.1 (few-shot, k=5) | 85.7 | Model Card and Evaluations for Claude Models | - |
| PaLM 540B (CoT Prompting) | 85.2 | Large Language Models Can Self-Improve | - |
| LLaMA 3 8B + MoSLoRA (fine-tuned) | 81.5 | Mixture-of-Subspaces in Low-Rank Adaptation | |