3 months ago

Finetuned Language Models Are Zero-Shot Learners

Jason Wei Maarten Bosma Vincent Y. Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai Quoc V. Le

Abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Code Repositories

hiyouga/llama-efficient-tuning

pytorch

Mentioned in GitHub

hojjat-mokhtarabadi/promptsource

Mentioned in GitHub

bigcode-project/starcoder

pytorch

Mentioned in GitHub

openbiolink/promptsource

Mentioned in GitHub

MS-P3/code6/tree/main/finetune

mindspore

google-research/flan

Official

Mentioned in GitHub

bigscience-workshop/promptsource

Mentioned in GitHub

ukplab/arxiv2025-inherent-limits-plms

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
common-sense-reasoning-on-arc-challenge	FLAN 137B (zero-shot)	Accuracy: 63.1
common-sense-reasoning-on-arc-challenge	FLAN 137B (few-shot, k=13)	Accuracy: 63.8
common-sense-reasoning-on-arc-easy	FLAN 137B (few-shot, k=14)	Accuracy: 80.7
common-sense-reasoning-on-arc-easy	FLAN 137B (0-shot)	Accuracy: 79.6
common-sense-reasoning-on-record	FLAN 137B (zero-shot)	EM: 72.5
common-sense-reasoning-on-record	FLAN 137B (prompt-tuned)	EM: 85.1
common-sense-reasoning-on-winogrande	FLAN 137B (few-shot, k=16)	Accuracy: 72.8
common-sense-reasoning-on-winogrande	FLAN 137B (0-shot)	Accuracy: 71.2
coreference-resolution-on-winograd-schema	FLAN 137B (prompt-tuned)	Accuracy: 86.5
coreference-resolution-on-winograd-schema	FLAN 137B (zero-shot)	Accuracy: 80.8
machine-translation-on-wmt2014-english-french	FLAN 137B (few-shot, k=9)	BLEU score: 33.8
machine-translation-on-wmt2014-english-french	FLAN 137B (zero-shot)	BLEU score: 33.9
machine-translation-on-wmt2014-french-english	FLAN 137B (few-shot, k=9)	BLEU score: 37.9
machine-translation-on-wmt2014-french-english	FLAN 137B (zero-shot)	BLEU score: 35.9
machine-translation-on-wmt2016-english-1	FLAN 137B (few-shot, k=9)	BLEU score: 20.5
machine-translation-on-wmt2016-english-1	FLAN 137B (zero-shot)	BLEU score: 18.9
machine-translation-on-wmt2016-english-german	FLAN 137B (few-shot, k=11)	BLEU score: 26.1
machine-translation-on-wmt2016-english-german	FLAN 137B (zero-shot)	BLEU score: 27.0
machine-translation-on-wmt2016-german-english	FLAN 137B (zero-shot)	BLEU score: 38.9
machine-translation-on-wmt2016-german-english	FLAN 137B (few-shot, k=11)	BLEU score: 40.7
machine-translation-on-wmt2016-romanian	FLAN 137B (few-shot, k=9)	BLEU score: 38.1
machine-translation-on-wmt2016-romanian	FLAN 137B (zero-shot)	BLEU score: 37.3
natural-language-inference-on-rte	FLAN 137B (8-shot)	Accuracy: 84.5%
natural-language-inference-on-rte	FLAN 137B (0-shot)	Accuracy: 84.1%
natural-language-inference-on-rte	FLAN 137B (prompt-tuned)	Accuracy: 91.7%
natural-language-inference-on-wnli	FLAN 137B (few-shot, k=4)	Accuracy: 70.4
natural-language-inference-on-wnli	FLAN 137B (zero-shot)	Accuracy: 74.6
question-answering-on-boolq	FLAN 137B (4-shot)	Accuracy: 84.6
question-answering-on-boolq	FLAN 137B (0-shot)	Accuracy: 82.9
question-answering-on-boolq	FLAN 137B (prompt-tuned)	Accuracy: 86.3
question-answering-on-copa	FLAN 137B (prompt-tuned)	Accuracy: 94
question-answering-on-copa	FLAN 137B (zero-shot)	Accuracy: 91
question-answering-on-copa	FLAN 137B (few-shot, k=16)	Accuracy: 87
question-answering-on-multirc	FLAN 137B (1-shot)	F1: 72.1
question-answering-on-multirc	FLAN 137B (prompt-tuned)	F1: 83.4
question-answering-on-multirc	FLAN 137B (zero-shot)	F1: 77.5
question-answering-on-naturalqa	FLAN 137B (zero-shot)	EM: 20.7
question-answering-on-obqa	FLAN 137B (few-shot, k=16)	Accuracy: 78.2
question-answering-on-obqa	FLAN 137B (zero-shot)	Accuracy: 78.4
question-answering-on-piqa	FLAN 137B (few-shot, k=10)	Accuracy: 81.7
question-answering-on-piqa	FLAN 137B (0-shot)	Accuracy: 80.5
question-answering-on-storycloze	FLAN 137B (few-shot, k=10)	Accuracy: 94.7
question-answering-on-storycloze	FLAN 137B (zero-shot)	Accuracy: 93.4
question-answering-on-triviaqa	FLAN 137B (zero-shot)	EM: 56.7
sentiment-analysis-on-imdb	FLAN 137B (zero-shot)	Accuracy: 94.3
sentiment-analysis-on-imdb	FLAN 137B (few-shot, k=2)	Accuracy: 95

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Finetuned Language Models Are Zero-Shot Learners

Jason Wei Maarten Bosma Vincent Y. Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai Quoc V. Le

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters