Command Palette
Search for a command to run...
Jiaxin Huang Shixiang Shane Gu Le Hou Yuexin Wu Xuezhi Wang Hongkun Yu Jiawei Han

Abstract
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Consistency) | Accuracy: 74.4 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 32.2 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 82.1 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (CoT Prompting) | Accuracy: 56.5 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 73.5 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Standard-Prompting) | Accuracy: 17.9 Parameters (Billion): 540 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 88.3 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (CoT Prompting) | Accuracy: 85.2 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Standard-Prompting) | Accuracy: 87.1 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 89.8 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 87.2 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Consistency) | Accuracy: 88.7 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Consistency) | A2: 64.5 A3: 63.4 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, Self Consistency) | A2: 66.5 A3: 67.9 |
| natural-language-inference-on-anli-test | PaLM 540B (CoT Prompting) | A2: 58.9 A3: 60.6 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, Standard-Prompting) | A2: 64.8 A3: 66.9 |
| natural-language-inference-on-anli-test | PaLM 540B (Standard-Prompting) | A2: 55.8 A3: 55.8 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, CoT Prompting) | A2: 65.3 A3: 67.3 |
| question-answering-on-drop | PaLM 540B (Self Consistency) | Accuracy: 78.2 |
| question-answering-on-drop | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 83 |
| question-answering-on-drop | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 71.7 |
| question-answering-on-drop | PaLM 540B (Standard-Prompting) | Accuracy: 60 |
| question-answering-on-drop | PaLM 540B (CoT Prompting) | Accuracy: 70.6 |
| question-answering-on-drop | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 76.2 |
| question-answering-on-openbookqa | PaLM 540B (Standard-Prompting) | Accuracy: 84.4 |
| question-answering-on-openbookqa | PaLM 540B (CoT Prompting) | Accuracy: 86.4 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 94.4 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 93 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 92 |
| question-answering-on-openbookqa | PaLM 540B (Self Consistency) | Accuracy: 90 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.