
摘要
大型语言模型(LLMs)在各类任务中已展现出卓越的性能。然而,对LLM进行微调通常需要大量标注数据作为监督信号。相比之下,人类可通过自我思考在无需外部输入的情况下提升推理能力。在本研究中,我们证明了大型语言模型仅依赖未标注数据集也具备自我改进的能力。我们利用预训练的LLM,通过思维链(Chain-of-Thought)提示和自一致性(self-consistency)机制,为未标注问题生成“高置信度”的带推理过程的答案,并将这些自生成的解答作为目标输出,用于后续的模型微调。实验结果表明,该方法显著提升了参数规模达5400亿的LLM的通用推理能力:在GSM8K数据集上的准确率从74.4%提升至82.1%,在DROP数据集上从78.2%提升至83.0%,在OpenBookQA上从90.0%提升至94.4%,在ANLI-A3上从63.4%提升至67.9%,且在整个过程中未使用任何真实标签(ground truth labels)。我们进一步开展了消融实验,验证了在推理能力上的微调对于实现自我改进的关键作用。该方法在无需人工标注的情况下实现了当前最先进的性能水平。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Consistency) | Accuracy: 74.4 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 32.2 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 82.1 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (CoT Prompting) | Accuracy: 56.5 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 73.5 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Standard-Prompting) | Accuracy: 17.9 Parameters (Billion): 540 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 88.3 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (CoT Prompting) | Accuracy: 85.2 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Standard-Prompting) | Accuracy: 87.1 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 89.8 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 87.2 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Consistency) | Accuracy: 88.7 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Consistency) | A2: 64.5 A3: 63.4 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, Self Consistency) | A2: 66.5 A3: 67.9 |
| natural-language-inference-on-anli-test | PaLM 540B (CoT Prompting) | A2: 58.9 A3: 60.6 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, Standard-Prompting) | A2: 64.8 A3: 66.9 |
| natural-language-inference-on-anli-test | PaLM 540B (Standard-Prompting) | A2: 55.8 A3: 55.8 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, CoT Prompting) | A2: 65.3 A3: 67.3 |
| question-answering-on-drop | PaLM 540B (Self Consistency) | Accuracy: 78.2 |
| question-answering-on-drop | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 83 |
| question-answering-on-drop | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 71.7 |
| question-answering-on-drop | PaLM 540B (Standard-Prompting) | Accuracy: 60 |
| question-answering-on-drop | PaLM 540B (CoT Prompting) | Accuracy: 70.6 |
| question-answering-on-drop | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 76.2 |
| question-answering-on-openbookqa | PaLM 540B (Standard-Prompting) | Accuracy: 84.4 |
| question-answering-on-openbookqa | PaLM 540B (CoT Prompting) | Accuracy: 86.4 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 94.4 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 93 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 92 |
| question-answering-on-openbookqa | PaLM 540B (Self Consistency) | Accuracy: 90 |