| PaLM 540B (Self Improvement, Self Consistency) | - | 66.5 | 67.9 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Improvement, CoT Prompting) | - | 65.3 | 67.3 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Improvement, Standard-Prompting) | - | 64.8 | 66.9 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Consistency) | - | 64.5 | 63.4 | Large Language Models Can Self-Improve | - |
| PaLM 540B (CoT Prompting) | - | 58.9 | 60.6 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Standard-Prompting) | - | 55.8 | 55.8 | Large Language Models Can Self-Improve | - |
| ALUM (RoBERTa-LARGE) | 72.3 | 52.1 | 48.4 | Adversarial Training for Large Neural Language Models | |