3 months ago

Language Models are Unsupervised Multitask Learners

{Jeffrey Wu Rewon Child Ilya Sutskever David Luan Alec Radford Dario Amodei}

Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typicallyapproached with supervised learning on taskspecific datasets. We demonstrate that languagemodels begin to learn these tasks without any explicit supervision when trained on a new datasetof millions of webpages called WebText. Whenconditioned on a document plus questions, the answers generated by the language model reach 55F1 on the CoQA dataset - matching or exceedingthe performance of 3 out of 4 baseline systemswithout using the 127,000+ training examples.The capacity of the language model is essentialto the success of zero-shot task transfer and increasing it improves performance in a log-linearfashion across tasks. Our largest model, GPT-2,is a 1.5B parameter Transformer that achievesstate of the art results on 7 out of 8 tested language modeling datasets in a zero-shot settingbut still underfits WebText. Samples from themodel reflect these improvements and contain coherent paragraphs of text. These findings suggesta promising path towards building language processing systems which learn to perform tasks fromtheir naturally occurring demonstrations.

Benchmarks

Benchmark	Methodology	Metrics
coreference-resolution-on-winograd-schema	GPT-2-XL 1.5B	Accuracy: 70.7
dialogue-state-tracking-on-simmc2-0	GPT-2	Act F1: 94.5 Slot F1: 81.7
document-summarization-on-cnn-daily-mail	GPT-2	ROUGE-1: 29.34 ROUGE-2: 8.27 ROUGE-L: 26.58
language-modelling-on-enwiki8	GPT-2 (48 layers, h=1600)	Bit per Character (BPC): 0.93 Number of params: 1542M
language-modelling-on-lambada	GPT-2 1.5B (Zero Shot)	Accuracy: 63.24 Perplexity: 8.63
language-modelling-on-one-billion-word	GPT-2	Number of params: 1.54B PPL: 42.16
language-modelling-on-penn-treebank-word	GPT-2	Params: 1542M Test perplexity: 35.76
language-modelling-on-text8	GPT-2	Bit per Character (BPC): 0.98 Number of params: 1542M
language-modelling-on-wikitext-103	GPT-2 Large	Number of params: 774M Test perplexity: 22.05
language-modelling-on-wikitext-103	GPT-2 Small	Number of params: 124M Test perplexity: 37.50
language-modelling-on-wikitext-103	GPT-2 Full	Number of params: 1542M Test perplexity: 17.48
language-modelling-on-wikitext-103	GPT-2 Medium	Number of params: 355M Test perplexity: 26.37
language-modelling-on-wikitext-2	GPT-2 (medium)	Number of params: 345M Test perplexity: 22.76
language-modelling-on-wikitext-2	GPT-2 (large)	Number of params: 762M Test perplexity: 19.93
language-modelling-on-wikitext-2	GPT-2	Number of params: 1542M Test perplexity: 18.34
language-modelling-on-wikitext-2	GPT-2 (small)	Number of params: 117M Test perplexity: 29.41
question-answering-on-fever	Zero-shot	EM: 50
question-answering-on-webquestions	Zero-shot	EM: 43
response-generation-on-simmc2-0	GPT-2	BLEU: 19.2
sentiment-analysis-on-imdb	GPT-2 Finetuned	Accuracy: 92.36
text-generation-on-openwebtext	GPT2-124M	eval_loss: 3.12

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette