3 个月前

大型语言模型可实现自我改进

大型语言模型可实现自我改进

摘要

大型语言模型(LLMs)在各类任务中已展现出卓越的性能。然而,对LLM进行微调通常需要大量标注数据作为监督信号。相比之下,人类可通过自我思考在无需外部输入的情况下提升推理能力。在本研究中,我们证明了大型语言模型仅依赖未标注数据集也具备自我改进的能力。我们利用预训练的LLM,通过思维链(Chain-of-Thought)提示和自一致性(self-consistency)机制,为未标注问题生成“高置信度”的带推理过程的答案,并将这些自生成的解答作为目标输出,用于后续的模型微调。实验结果表明,该方法显著提升了参数规模达5400亿的LLM的通用推理能力:在GSM8K数据集上的准确率从74.4%提升至82.1%,在DROP数据集上从78.2%提升至83.0%,在OpenBookQA上从90.0%提升至94.4%,在ANLI-A3上从63.4%提升至67.9%,且在整个过程中未使用任何真实标签(ground truth labels)。我们进一步开展了消融实验,验证了在推理能力上的微调对于实现自我改进的关键作用。该方法在无需人工标注的情况下实现了当前最先进的性能水平。

基准测试

基准方法指标
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Consistency)
Accuracy: 74.4
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 32.2
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Improvement, Self Consistency)
Accuracy: 82.1
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (CoT Prompting)
Accuracy: 56.5
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 73.5
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Standard-Prompting)
Accuracy: 17.9
Parameters (Billion): 540
common-sense-reasoning-on-arc-challengePaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 88.3
common-sense-reasoning-on-arc-challengePaLM 540B (CoT Prompting)
Accuracy: 85.2
common-sense-reasoning-on-arc-challengePaLM 540B (Standard-Prompting)
Accuracy: 87.1
common-sense-reasoning-on-arc-challengePaLM 540B (Self Improvement, Self Consistency)
Accuracy: 89.8
common-sense-reasoning-on-arc-challengePaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 87.2
common-sense-reasoning-on-arc-challengePaLM 540B (Self Consistency)
Accuracy: 88.7
natural-language-inference-on-anli-testPaLM 540B (Self Consistency)
A2: 64.5
A3: 63.4
natural-language-inference-on-anli-testPaLM 540B (Self Improvement, Self Consistency)
A2: 66.5
A3: 67.9
natural-language-inference-on-anli-testPaLM 540B (CoT Prompting)
A2: 58.9
A3: 60.6
natural-language-inference-on-anli-testPaLM 540B (Self Improvement, Standard-Prompting)
A2: 64.8
A3: 66.9
natural-language-inference-on-anli-testPaLM 540B (Standard-Prompting)
A2: 55.8
A3: 55.8
natural-language-inference-on-anli-testPaLM 540B (Self Improvement, CoT Prompting)
A2: 65.3
A3: 67.3
question-answering-on-dropPaLM 540B (Self Consistency)
Accuracy: 78.2
question-answering-on-dropPaLM 540B (Self Improvement, Self Consistency)
Accuracy: 83
question-answering-on-dropPaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 71.7
question-answering-on-dropPaLM 540B (Standard-Prompting)
Accuracy: 60
question-answering-on-dropPaLM 540B (CoT Prompting)
Accuracy: 70.6
question-answering-on-dropPaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 76.2
question-answering-on-openbookqaPaLM 540B (Standard-Prompting)
Accuracy: 84.4
question-answering-on-openbookqaPaLM 540B (CoT Prompting)
Accuracy: 86.4
question-answering-on-openbookqaPaLM 540B (Self Improvement, Self Consistency)
Accuracy: 94.4
question-answering-on-openbookqaPaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 93
question-answering-on-openbookqaPaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 92
question-answering-on-openbookqaPaLM 540B (Self Consistency)
Accuracy: 90

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
大型语言模型可实现自我改进 | 论文 | HyperAI超神经