Command Palette
Search for a command to run...
Hyung Won Chung; Le Hou; Shayne Longpre; Barret Zoph; Yi Tay; William Fedus; Yunxuan Li; Xuezhi Wang; Mostafa Dehghani; Siddhartha Brahma; Albert Webson; Shixiang Shane Gu; Zhuyun Dai; Mirac Suzgun; Xinyun Chen; Aakanksha Chowdhery; Alex Castro-Ros; Marie Pellat; Kevin Robinson; Dasha Valter; Sharan Narang; Gaurav Mishra; Adams Yu; Vincent Zhao; Yanping Huang; Andrew Dai; Hongkun Yu; Slav Petrov; Ed H. Chi; Jeff Dean; Jacob Devlin; Adam Roberts; Denny Zhou; Quoc V. Le; Jason Wei

Abstract
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| coreference-resolution-on-winograd-schema | Flan-T5 XXL (zero -shot) | Accuracy: 89.82 |
| cross-lingual-question-answering-on-tydiqa | Flan-PaLM 540B (direct-prompting) | EM: 67.8 |
| cross-lingual-question-answering-on-tydiqa | Flan-U-PaLM 540B (direct-prompting) | EM: 68.3 |
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | Average (%): 61.3 |
| multi-task-language-understanding-on-bbh-alg | PaLM 540B (CoT) | Average (%): 57.6 |
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | Average (%): 66.5 |
| multi-task-language-understanding-on-bbh-alg | PaLM 540B | Average (%): 38.3 |
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned) | Average (%): 48.2 |
| multi-task-language-understanding-on-bbh-alg | PaLM 540B (CoT + self-consistency) | Average (%): 62.2 |
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B (CoT) | Average (%): 71.2 |
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B | Average (%): 62.7 |
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (5-shot, finetuned) | Average (%): 70.0 |
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | Average (%): 78.4 |
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B (CoT + self-consistency) | Average (%): 78.2 |
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | Average (%): 72.4 |
| multi-task-language-understanding-on-mgsm | Flan-U-PaLM 540B (CoT) | Average (%): 60.4 |
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned, CoT + SC) | Average (%): 72.0 |
| multi-task-language-understanding-on-mgsm | code-davinci-002 | Average (%): 35 |
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned, CoT) | Average (%): 57.0 |
| multi-task-language-understanding-on-mgsm | GPT-3 Davinci 175B | Average (%): 5.7 |
| multi-task-language-understanding-on-mgsm | text-davinci-003 | Average (%): 36 |
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned) | Average (%): 21.2 |
| multi-task-language-understanding-on-mgsm | text-davinci-002 | Average (%): 23.7 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Base 250M (CoT) | Average (%): 33.7 |
| multi-task-language-understanding-on-mmlu | llama 2(65b) | Average (%): 73.5 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Small 80M | Average (%): 28.7 |
| multi-task-language-understanding-on-mmlu | GPT-3 Davinci 175B (CoT) | Average (%): 59.5 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Large 780M | Average (%): 45.1 |
| multi-task-language-understanding-on-mmlu | Flan-T5-XL 3B (CoT) | Average (%): 45.5 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Base 250M | Average (%): 35.9 |
| multi-task-language-understanding-on-mmlu | Flan-PaLM (5-shot, finetuned) | Average (%): 72.2 |
| multi-task-language-understanding-on-mmlu | Flan-T5-Large 780M (CoT) | Average (%): 40.5 |
| multi-task-language-understanding-on-mmlu | GPT-3 Davinci 175B (5-shot) | Average (%): 39.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.