Command Palette
Search for a command to run...
Jason Wei Maarten Bosma Vincent Y. Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai Quoc V. Le

Abstract
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| common-sense-reasoning-on-arc-challenge | FLAN 137B (zero-shot) | Accuracy: 63.1 |
| common-sense-reasoning-on-arc-challenge | FLAN 137B (few-shot, k=13) | Accuracy: 63.8 |
| common-sense-reasoning-on-arc-easy | FLAN 137B (few-shot, k=14) | Accuracy: 80.7 |
| common-sense-reasoning-on-arc-easy | FLAN 137B (0-shot) | Accuracy: 79.6 |
| common-sense-reasoning-on-record | FLAN 137B (zero-shot) | EM: 72.5 |
| common-sense-reasoning-on-record | FLAN 137B (prompt-tuned) | EM: 85.1 |
| common-sense-reasoning-on-winogrande | FLAN 137B (few-shot, k=16) | Accuracy: 72.8 |
| common-sense-reasoning-on-winogrande | FLAN 137B (0-shot) | Accuracy: 71.2 |
| coreference-resolution-on-winograd-schema | FLAN 137B (prompt-tuned) | Accuracy: 86.5 |
| coreference-resolution-on-winograd-schema | FLAN 137B (zero-shot) | Accuracy: 80.8 |
| machine-translation-on-wmt2014-english-french | FLAN 137B (few-shot, k=9) | BLEU score: 33.8 |
| machine-translation-on-wmt2014-english-french | FLAN 137B (zero-shot) | BLEU score: 33.9 |
| machine-translation-on-wmt2014-french-english | FLAN 137B (few-shot, k=9) | BLEU score: 37.9 |
| machine-translation-on-wmt2014-french-english | FLAN 137B (zero-shot) | BLEU score: 35.9 |
| machine-translation-on-wmt2016-english-1 | FLAN 137B (few-shot, k=9) | BLEU score: 20.5 |
| machine-translation-on-wmt2016-english-1 | FLAN 137B (zero-shot) | BLEU score: 18.9 |
| machine-translation-on-wmt2016-english-german | FLAN 137B (few-shot, k=11) | BLEU score: 26.1 |
| machine-translation-on-wmt2016-english-german | FLAN 137B (zero-shot) | BLEU score: 27.0 |
| machine-translation-on-wmt2016-german-english | FLAN 137B (zero-shot) | BLEU score: 38.9 |
| machine-translation-on-wmt2016-german-english | FLAN 137B (few-shot, k=11) | BLEU score: 40.7 |
| machine-translation-on-wmt2016-romanian | FLAN 137B (few-shot, k=9) | BLEU score: 38.1 |
| machine-translation-on-wmt2016-romanian | FLAN 137B (zero-shot) | BLEU score: 37.3 |
| natural-language-inference-on-rte | FLAN 137B (8-shot) | Accuracy: 84.5% |
| natural-language-inference-on-rte | FLAN 137B (0-shot) | Accuracy: 84.1% |
| natural-language-inference-on-rte | FLAN 137B (prompt-tuned) | Accuracy: 91.7% |
| natural-language-inference-on-wnli | FLAN 137B (few-shot, k=4) | Accuracy: 70.4 |
| natural-language-inference-on-wnli | FLAN 137B (zero-shot) | Accuracy: 74.6 |
| question-answering-on-boolq | FLAN 137B (4-shot) | Accuracy: 84.6 |
| question-answering-on-boolq | FLAN 137B (0-shot) | Accuracy: 82.9 |
| question-answering-on-boolq | FLAN 137B (prompt-tuned) | Accuracy: 86.3 |
| question-answering-on-copa | FLAN 137B (prompt-tuned) | Accuracy: 94 |
| question-answering-on-copa | FLAN 137B (zero-shot) | Accuracy: 91 |
| question-answering-on-copa | FLAN 137B (few-shot, k=16) | Accuracy: 87 |
| question-answering-on-multirc | FLAN 137B (1-shot) | F1: 72.1 |
| question-answering-on-multirc | FLAN 137B (prompt-tuned) | F1: 83.4 |
| question-answering-on-multirc | FLAN 137B (zero-shot) | F1: 77.5 |
| question-answering-on-naturalqa | FLAN 137B (zero-shot) | EM: 20.7 |
| question-answering-on-obqa | FLAN 137B (few-shot, k=16) | Accuracy: 78.2 |
| question-answering-on-obqa | FLAN 137B (zero-shot) | Accuracy: 78.4 |
| question-answering-on-piqa | FLAN 137B (few-shot, k=10) | Accuracy: 81.7 |
| question-answering-on-piqa | FLAN 137B (0-shot) | Accuracy: 80.5 |
| question-answering-on-storycloze | FLAN 137B (few-shot, k=10) | Accuracy: 94.7 |
| question-answering-on-storycloze | FLAN 137B (zero-shot) | Accuracy: 93.4 |
| question-answering-on-triviaqa | FLAN 137B (zero-shot) | EM: 56.7 |
| sentiment-analysis-on-imdb | FLAN 137B (zero-shot) | Accuracy: 94.3 |
| sentiment-analysis-on-imdb | FLAN 137B (few-shot, k=2) | Accuracy: 95 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.