4 个月前

扩展指令微调的语言模型

扩展指令微调的语言模型

摘要

通过在一系列以指令形式表述的数据集上对语言模型进行微调,已被证明可以提高模型性能并增强其对未见过任务的泛化能力。本文特别探讨了指令微调的三个方面:(1)扩展任务数量,(2)扩展模型规模,以及(3)基于链式思维数据的微调。研究发现,结合上述方面的指令微调显著提升了多种模型类别(PaLM、T5、U-PaLM)、提示设置(零样本、少样本、CoT)和评估基准(MMLU、BBH、TyDiQA、MGSM、开放式生成)上的性能。例如,经过1.8K个任务指令微调的Flan-PaLM 540B在多个评估指标上大幅超越了PaLM 540B(平均提升9.4%)。Flan-PaLM 540B在五次提示的MMLU基准测试中达到了75.2%的准确率,实现了当前最佳性能。此外,我们还公开发布了Flan-T5检查点,这些检查点即使与更大规模的模型(如PaLM 62B)相比也表现出强大的少样本性能。总体而言,指令微调是一种普遍适用的方法,能够有效提升预训练语言模型的性能和可用性。

代码仓库

基准测试

基准方法指标
coreference-resolution-on-winograd-schemaFlan-T5 XXL (zero -shot)
Accuracy: 89.82
cross-lingual-question-answering-on-tydiqaFlan-PaLM 540B (direct-prompting)
EM: 67.8
cross-lingual-question-answering-on-tydiqaFlan-U-PaLM 540B (direct-prompting)
EM: 68.3
multi-task-language-understanding-on-bbh-algFlan-PaLM 540B (3-shot, fine-tuned, CoT)
Average (%): 61.3
multi-task-language-understanding-on-bbh-algPaLM 540B (CoT)
Average (%): 57.6
multi-task-language-understanding-on-bbh-algFlan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Average (%): 66.5
multi-task-language-understanding-on-bbh-algPaLM 540B
Average (%): 38.3
multi-task-language-understanding-on-bbh-algFlan-PaLM 540B (3-shot, fine-tuned)
Average (%): 48.2
multi-task-language-understanding-on-bbh-algPaLM 540B (CoT + self-consistency)
Average (%): 62.2
multi-task-language-understanding-on-bbh-nlpPaLM 540B (CoT)
Average (%): 71.2
multi-task-language-understanding-on-bbh-nlpPaLM 540B
Average (%): 62.7
multi-task-language-understanding-on-bbh-nlpFlan-PaLM 540B (5-shot, finetuned)
Average (%): 70.0
multi-task-language-understanding-on-bbh-nlpFlan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Average (%): 78.4
multi-task-language-understanding-on-bbh-nlpPaLM 540B (CoT + self-consistency)
Average (%): 78.2
multi-task-language-understanding-on-bbh-nlpFlan-PaLM 540B (3-shot, fine-tuned, CoT)
Average (%): 72.4
multi-task-language-understanding-on-mgsmFlan-U-PaLM 540B (CoT)
Average (%): 60.4
multi-task-language-understanding-on-mgsmFlan-PaLM 540B (8-shot, fine-tuned, CoT + SC)
Average (%): 72.0
multi-task-language-understanding-on-mgsmcode-davinci-002
Average (%): 35
multi-task-language-understanding-on-mgsmFlan-PaLM 540B (8-shot, fine-tuned, CoT)
Average (%): 57.0
multi-task-language-understanding-on-mgsmGPT-3 Davinci 175B
Average (%): 5.7
multi-task-language-understanding-on-mgsmtext-davinci-003
Average (%): 36
multi-task-language-understanding-on-mgsmFlan-PaLM 540B (8-shot, fine-tuned)
Average (%): 21.2
multi-task-language-understanding-on-mgsmtext-davinci-002
Average (%): 23.7
multi-task-language-understanding-on-mmluFlan-T5-Base 250M (CoT)
Average (%): 33.7
multi-task-language-understanding-on-mmlullama 2(65b)
Average (%): 73.5
multi-task-language-understanding-on-mmluFlan-T5-Small 80M
Average (%): 28.7
multi-task-language-understanding-on-mmluGPT-3 Davinci 175B (CoT)
Average (%): 59.5
multi-task-language-understanding-on-mmluFlan-T5-Large 780M
Average (%): 45.1
multi-task-language-understanding-on-mmluFlan-T5-XL 3B (CoT)
Average (%): 45.5
multi-task-language-understanding-on-mmluFlan-T5-Base 250M
Average (%): 35.9
multi-task-language-understanding-on-mmluFlan-PaLM (5-shot, finetuned)
Average (%): 72.2
multi-task-language-understanding-on-mmluFlan-T5-Large 780M (CoT)
Average (%): 40.5
multi-task-language-understanding-on-mmluGPT-3 Davinci 175B (5-shot)
Average (%): 39.7

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
扩展指令微调的语言模型 | 论文 | HyperAI超神经