| PaLM 540B (Self Improvement, Self Consistency) | 94.4 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Improvement, CoT Prompting) | 93 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Self Improvement, Standard-Prompting) | 92 | Large Language Models Can Self-Improve | - |
| DeBERTa-xxlarge 1.5B + MVP-Tuning | 91.3 | - | - |
| PaLM 540B (Self Consistency) | 90 | Large Language Models Can Self-Improve | - |
| AristoRoBERTa + MVP-Tuning | 87.6 | - | - |
| AristoRoBERTa + Graph Soft Counter | 87.4 | GNN is a Counter? Revisiting GNN for Question Answering | - |
| PaLM 540B (CoT Prompting) | 86.4 | Large Language Models Can Self-Improve | - |
| PaLM 540B (Standard-Prompting) | 84.4 | Large Language Models Can Self-Improve | - |