Command Palette
Search for a command to run...
{Jeffrey Wu Rewon Child Ilya Sutskever David Luan Alec Radford Dario Amodei}
Abstract
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typicallyapproached with supervised learning on taskspecific datasets. We demonstrate that languagemodels begin to learn these tasks without any explicit supervision when trained on a new datasetof millions of webpages called WebText. Whenconditioned on a document plus questions, the answers generated by the language model reach 55F1 on the CoQA dataset - matching or exceedingthe performance of 3 out of 4 baseline systemswithout using the 127,000+ training examples.The capacity of the language model is essentialto the success of zero-shot task transfer and increasing it improves performance in a log-linearfashion across tasks. Our largest model, GPT-2,is a 1.5B parameter Transformer that achievesstate of the art results on 7 out of 8 tested language modeling datasets in a zero-shot settingbut still underfits WebText. Samples from themodel reflect these improvements and contain coherent paragraphs of text. These findings suggesta promising path towards building language processing systems which learn to perform tasks fromtheir naturally occurring demonstrations.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| coreference-resolution-on-winograd-schema | GPT-2-XL 1.5B | Accuracy: 70.7 |
| dialogue-state-tracking-on-simmc2-0 | GPT-2 | Act F1: 94.5 Slot F1: 81.7 |
| document-summarization-on-cnn-daily-mail | GPT-2 | ROUGE-1: 29.34 ROUGE-2: 8.27 ROUGE-L: 26.58 |
| language-modelling-on-enwiki8 | GPT-2 (48 layers, h=1600) | Bit per Character (BPC): 0.93 Number of params: 1542M |
| language-modelling-on-lambada | GPT-2 1.5B (Zero Shot) | Accuracy: 63.24 Perplexity: 8.63 |
| language-modelling-on-one-billion-word | GPT-2 | Number of params: 1.54B PPL: 42.16 |
| language-modelling-on-penn-treebank-word | GPT-2 | Params: 1542M Test perplexity: 35.76 |
| language-modelling-on-text8 | GPT-2 | Bit per Character (BPC): 0.98 Number of params: 1542M |
| language-modelling-on-wikitext-103 | GPT-2 Large | Number of params: 774M Test perplexity: 22.05 |
| language-modelling-on-wikitext-103 | GPT-2 Small | Number of params: 124M Test perplexity: 37.50 |
| language-modelling-on-wikitext-103 | GPT-2 Full | Number of params: 1542M Test perplexity: 17.48 |
| language-modelling-on-wikitext-103 | GPT-2 Medium | Number of params: 355M Test perplexity: 26.37 |
| language-modelling-on-wikitext-2 | GPT-2 (medium) | Number of params: 345M Test perplexity: 22.76 |
| language-modelling-on-wikitext-2 | GPT-2 (large) | Number of params: 762M Test perplexity: 19.93 |
| language-modelling-on-wikitext-2 | GPT-2 | Number of params: 1542M Test perplexity: 18.34 |
| language-modelling-on-wikitext-2 | GPT-2 (small) | Number of params: 117M Test perplexity: 29.41 |
| question-answering-on-fever | Zero-shot | EM: 50 |
| question-answering-on-webquestions | Zero-shot | EM: 43 |
| response-generation-on-simmc2-0 | GPT-2 | BLEU: 19.2 |
| sentiment-analysis-on-imdb | GPT-2 Finetuned | Accuracy: 92.36 |
| text-generation-on-openwebtext | GPT2-124M | eval_loss: 3.12 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.