HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Scaling Instruction-Finetuned Language Models

Hyung Won Chung; Le Hou; Shayne Longpre; Barret Zoph; Yi Tay; William Fedus; Yunxuan Li; Xuezhi Wang; Mostafa Dehghani; Siddhartha Brahma; Albert Webson; Shixiang Shane Gu; Zhuyun Dai; Mirac Suzgun; Xinyun Chen; Aakanksha Chowdhery; Alex Castro-Ros; Marie Pellat; Kevin Robinson; Dasha Valter; Sharan Narang; Gaurav Mishra; Adams Yu; Vincent Zhao; Yanping Huang; Andrew Dai; Hongkun Yu; Slav Petrov; Ed H. Chi; Jeff Dean; Jacob Devlin; Adam Roberts; Denny Zhou; Quoc V. Le; Jason Wei

Scaling Instruction-Finetuned Language Models

Abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Code Repositories

declare-lab/flan-alpaca
pytorch
Mentioned in GitHub
joelniklaus/lawinstruct
Mentioned in GitHub
formulamonks/llm-benchmarker-suite
pytorch
Mentioned in GitHub
google-research/flan
tf
Mentioned in GitHub
theoremone/llm-benchmarker-suite
pytorch
Mentioned in GitHub
zchuz/timebench
Mentioned in GitHub
kapllan/zeroshot_lexglue
Mentioned in GitHub
coastalcph/zeroshot_lexglue
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
coreference-resolution-on-winograd-schemaFlan-T5 XXL (zero -shot)
Accuracy: 89.82
cross-lingual-question-answering-on-tydiqaFlan-PaLM 540B (direct-prompting)
EM: 67.8
cross-lingual-question-answering-on-tydiqaFlan-U-PaLM 540B (direct-prompting)
EM: 68.3
multi-task-language-understanding-on-bbh-algFlan-PaLM 540B (3-shot, fine-tuned, CoT)
Average (%): 61.3
multi-task-language-understanding-on-bbh-algPaLM 540B (CoT)
Average (%): 57.6
multi-task-language-understanding-on-bbh-algFlan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Average (%): 66.5
multi-task-language-understanding-on-bbh-algPaLM 540B
Average (%): 38.3
multi-task-language-understanding-on-bbh-algFlan-PaLM 540B (3-shot, fine-tuned)
Average (%): 48.2
multi-task-language-understanding-on-bbh-algPaLM 540B (CoT + self-consistency)
Average (%): 62.2
multi-task-language-understanding-on-bbh-nlpPaLM 540B (CoT)
Average (%): 71.2
multi-task-language-understanding-on-bbh-nlpPaLM 540B
Average (%): 62.7
multi-task-language-understanding-on-bbh-nlpFlan-PaLM 540B (5-shot, finetuned)
Average (%): 70.0
multi-task-language-understanding-on-bbh-nlpFlan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Average (%): 78.4
multi-task-language-understanding-on-bbh-nlpPaLM 540B (CoT + self-consistency)
Average (%): 78.2
multi-task-language-understanding-on-bbh-nlpFlan-PaLM 540B (3-shot, fine-tuned, CoT)
Average (%): 72.4
multi-task-language-understanding-on-mgsmFlan-U-PaLM 540B (CoT)
Average (%): 60.4
multi-task-language-understanding-on-mgsmFlan-PaLM 540B (8-shot, fine-tuned, CoT + SC)
Average (%): 72.0
multi-task-language-understanding-on-mgsmcode-davinci-002
Average (%): 35
multi-task-language-understanding-on-mgsmFlan-PaLM 540B (8-shot, fine-tuned, CoT)
Average (%): 57.0
multi-task-language-understanding-on-mgsmGPT-3 Davinci 175B
Average (%): 5.7
multi-task-language-understanding-on-mgsmtext-davinci-003
Average (%): 36
multi-task-language-understanding-on-mgsmFlan-PaLM 540B (8-shot, fine-tuned)
Average (%): 21.2
multi-task-language-understanding-on-mgsmtext-davinci-002
Average (%): 23.7
multi-task-language-understanding-on-mmluFlan-T5-Base 250M (CoT)
Average (%): 33.7
multi-task-language-understanding-on-mmlullama 2(65b)
Average (%): 73.5
multi-task-language-understanding-on-mmluFlan-T5-Small 80M
Average (%): 28.7
multi-task-language-understanding-on-mmluGPT-3 Davinci 175B (CoT)
Average (%): 59.5
multi-task-language-understanding-on-mmluFlan-T5-Large 780M
Average (%): 45.1
multi-task-language-understanding-on-mmluFlan-T5-XL 3B (CoT)
Average (%): 45.5
multi-task-language-understanding-on-mmluFlan-T5-Base 250M
Average (%): 35.9
multi-task-language-understanding-on-mmluFlan-PaLM (5-shot, finetuned)
Average (%): 72.2
multi-task-language-understanding-on-mmluFlan-T5-Large 780M (CoT)
Average (%): 40.5
multi-task-language-understanding-on-mmluGPT-3 Davinci 175B (5-shot)
Average (%): 39.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Scaling Instruction-Finetuned Language Models | Papers | HyperAI