
摘要
通过在一系列以指令形式表述的数据集上对语言模型进行微调,已被证明可以提高模型性能并增强其对未见过任务的泛化能力。本文特别探讨了指令微调的三个方面:(1)扩展任务数量,(2)扩展模型规模,以及(3)基于链式思维数据的微调。研究发现,结合上述方面的指令微调显著提升了多种模型类别(PaLM、T5、U-PaLM)、提示设置(零样本、少样本、CoT)和评估基准(MMLU、BBH、TyDiQA、MGSM、开放式生成)上的性能。例如,经过1.8K个任务指令微调的Flan-PaLM 540B在多个评估指标上大幅超越了PaLM 540B(平均提升9.4%)。Flan-PaLM 540B在五次提示的MMLU基准测试中达到了75.2%的准确率,实现了当前最佳性能。此外,我们还公开发布了Flan-T5检查点,这些检查点即使与更大规模的模型(如PaLM 62B)相比也表现出强大的少样本性能。总体而言,指令微调是一种普遍适用的方法,能够有效提升预训练语言模型的性能和可用性。
代码仓库
declare-lab/flan-alpaca
pytorch
GitHub 中提及
joelniklaus/lawinstruct
GitHub 中提及
formulamonks/llm-benchmarker-suite
pytorch
GitHub 中提及
google-research/flan
tf
GitHub 中提及
theoremone/llm-benchmarker-suite
pytorch
GitHub 中提及
zchuz/timebench
GitHub 中提及
kapllan/zeroshot_lexglue
GitHub 中提及
coastalcph/zeroshot_lexglue
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| coreference-resolution-on-winograd-schema | Flan-T5 XXL (zero -shot) | Accuracy: 89.82 |
| cross-lingual-question-answering-on-tydiqa | Flan-PaLM 540B (direct-prompting) | EM: 67.8 |
| cross-lingual-question-answering-on-tydiqa | Flan-U-PaLM 540B (direct-prompting) | EM: 68.3 |
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | Average (%): 61.3 |
| multi-task-language-understanding-on-bbh-alg | PaLM 540B (CoT) | Average (%): 57.6 |
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | Average (%): 66.5 |
| multi-task-language-understanding-on-bbh-alg | PaLM 540B | Average (%): 38.3 |
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned) | Average (%): 48.2 |
| multi-task-language-understanding-on-bbh-alg | PaLM 540B (CoT + self-consistency) | Average (%): 62.2 |
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B (CoT) | Average (%): 71.2 |
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B | Average (%): 62.7 |
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (5-shot, finetuned) | Average (%): 70.0 |
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | Average (%): 78.4 |
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B (CoT + self-consistency) | Average (%): 78.2 |
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | Average (%): 72.4 |
| multi-task-language-understanding-on-mgsm | Flan-U-PaLM 540B (CoT) | Average (%): 60.4 |
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned, CoT + SC) | Average (%): 72.0 |
| multi-task-language-understanding-on-mgsm | code-davinci-002 | Average (%): 35 |
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned, CoT) | Average (%): 57.0 |
| multi-task-language-understanding-on-mgsm | GPT-3 Davinci 175B | Average (%): 5.7 |
| multi-task-language-understanding-on-mgsm | text-davinci-003 | Average (%): 36 |
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned) | Average (%): 21.2 |
| multi-task-language-understanding-on-mgsm | text-davinci-002 | Average (%): 23.7 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Base 250M (CoT) | Average (%): 33.7 |
| multi-task-language-understanding-on-mmlu | llama 2(65b) | Average (%): 73.5 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Small 80M | Average (%): 28.7 |
| multi-task-language-understanding-on-mmlu | GPT-3 Davinci 175B (CoT) | Average (%): 59.5 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Large 780M | Average (%): 45.1 |
| multi-task-language-understanding-on-mmlu | Flan-T5-XL 3B (CoT) | Average (%): 45.5 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Base 250M | Average (%): 35.9 |
| multi-task-language-understanding-on-mmlu | Flan-PaLM (5-shot, finetuned) | Average (%): 72.2 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Large 780M (CoT) | Average (%): 40.5 |
| multi-task-language-understanding-on-mmlu | GPT-3 Davinci 175B (5-shot) | Average (%): 39.7 |