3 个月前

微调后的语言模型是零样本学习者

微调后的语言模型是零样本学习者

摘要

本文提出了一种简单有效的方法,用于提升语言模型的零样本学习能力。我们证明,通过使用自然语言指令模板描述任务集合对语言模型进行指令微调(instruction tuning),可显著提升模型在未见任务上的零样本性能。我们以一个拥有1370亿参数的预训练语言模型为基础,在超过60个自然语言处理任务上对其进行指令微调,这些任务均以自然语言指令模板的形式进行表述。我们将其微调后的模型命名为FLAN,并在未见过的任务类型上对其进行评估。结果表明,FLAN在性能上显著优于未经修改的原始模型,并在所评估的25项任务中,有20项超越了1750亿参数的零样本GPT-3。此外,FLAN在ANLI、RTE、BoolQ、AI2-ARC、OpenbookQA和StoryCloze等任务上,甚至大幅领先于少样本学习的GPT-3。消融实验进一步揭示,微调数据集的数量、模型规模以及自然语言指令的设计,是指令微调取得成功的关键因素。

代码仓库

基准测试

基准方法指标
common-sense-reasoning-on-arc-challengeFLAN 137B (zero-shot)
Accuracy: 63.1
common-sense-reasoning-on-arc-challengeFLAN 137B (few-shot, k=13)
Accuracy: 63.8
common-sense-reasoning-on-arc-easyFLAN 137B (few-shot, k=14)
Accuracy: 80.7
common-sense-reasoning-on-arc-easyFLAN 137B (0-shot)
Accuracy: 79.6
common-sense-reasoning-on-recordFLAN 137B (zero-shot)
EM: 72.5
common-sense-reasoning-on-recordFLAN 137B (prompt-tuned)
EM: 85.1
common-sense-reasoning-on-winograndeFLAN 137B (few-shot, k=16)
Accuracy: 72.8
common-sense-reasoning-on-winograndeFLAN 137B (0-shot)
Accuracy: 71.2
coreference-resolution-on-winograd-schemaFLAN 137B (prompt-tuned)
Accuracy: 86.5
coreference-resolution-on-winograd-schemaFLAN 137B (zero-shot)
Accuracy: 80.8
machine-translation-on-wmt2014-english-frenchFLAN 137B (few-shot, k=9)
BLEU score: 33.8
machine-translation-on-wmt2014-english-frenchFLAN 137B (zero-shot)
BLEU score: 33.9
machine-translation-on-wmt2014-french-englishFLAN 137B (few-shot, k=9)
BLEU score: 37.9
machine-translation-on-wmt2014-french-englishFLAN 137B (zero-shot)
BLEU score: 35.9
machine-translation-on-wmt2016-english-1FLAN 137B (few-shot, k=9)
BLEU score: 20.5
machine-translation-on-wmt2016-english-1FLAN 137B (zero-shot)
BLEU score: 18.9
machine-translation-on-wmt2016-english-germanFLAN 137B (few-shot, k=11)
BLEU score: 26.1
machine-translation-on-wmt2016-english-germanFLAN 137B (zero-shot)
BLEU score: 27.0
machine-translation-on-wmt2016-german-englishFLAN 137B (zero-shot)
BLEU score: 38.9
machine-translation-on-wmt2016-german-englishFLAN 137B (few-shot, k=11)
BLEU score: 40.7
machine-translation-on-wmt2016-romanianFLAN 137B (few-shot, k=9)
BLEU score: 38.1
machine-translation-on-wmt2016-romanianFLAN 137B (zero-shot)
BLEU score: 37.3
natural-language-inference-on-rteFLAN 137B (8-shot)
Accuracy: 84.5%
natural-language-inference-on-rteFLAN 137B (0-shot)
Accuracy: 84.1%
natural-language-inference-on-rteFLAN 137B (prompt-tuned)
Accuracy: 91.7%
natural-language-inference-on-wnliFLAN 137B (few-shot, k=4)
Accuracy: 70.4
natural-language-inference-on-wnliFLAN 137B (zero-shot)
Accuracy: 74.6
question-answering-on-boolqFLAN 137B (4-shot)
Accuracy: 84.6
question-answering-on-boolqFLAN 137B (0-shot)
Accuracy: 82.9
question-answering-on-boolqFLAN 137B (prompt-tuned)
Accuracy: 86.3
question-answering-on-copaFLAN 137B (prompt-tuned)
Accuracy: 94
question-answering-on-copaFLAN 137B (zero-shot)
Accuracy: 91
question-answering-on-copaFLAN 137B (few-shot, k=16)
Accuracy: 87
question-answering-on-multircFLAN 137B (1-shot)
F1: 72.1
question-answering-on-multircFLAN 137B (prompt-tuned)
F1: 83.4
question-answering-on-multircFLAN 137B (zero-shot)
F1: 77.5
question-answering-on-naturalqaFLAN 137B (zero-shot)
EM: 20.7
question-answering-on-obqaFLAN 137B (few-shot, k=16)
Accuracy: 78.2
question-answering-on-obqaFLAN 137B (zero-shot)
Accuracy: 78.4
question-answering-on-piqaFLAN 137B (few-shot, k=10)
Accuracy: 81.7
question-answering-on-piqaFLAN 137B (0-shot)
Accuracy: 80.5
question-answering-on-storyclozeFLAN 137B (few-shot, k=10)
Accuracy: 94.7
question-answering-on-storyclozeFLAN 137B (zero-shot)
Accuracy: 93.4
question-answering-on-triviaqaFLAN 137B (zero-shot)
EM: 56.7
sentiment-analysis-on-imdbFLAN 137B (zero-shot)
Accuracy: 94.3
sentiment-analysis-on-imdbFLAN 137B (few-shot, k=2)
Accuracy: 95

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
微调后的语言模型是零样本学习者 | 论文 | HyperAI超神经