摘要
自然语言处理任务,如问答、机器翻译、阅读理解与摘要生成,通常通过在特定任务数据集上采用监督学习方法来实现。我们证明,当在一项包含数百万网页的新数据集——WebText上进行训练时,语言模型无需任何显式监督即可开始学习这些任务。在给定文档和问题作为条件的情况下,该语言模型生成的答案在CoQA数据集上达到了55的F1分数,其性能与现有四种基线系统中的三种相当,甚至超过其中三种,且完全未使用超过12.7万个训练样本。语言模型的容量对于零样本任务迁移的成功至关重要,且随着模型容量的增加,其在各类任务上的性能呈对数线性提升。我们最大的模型GPT-2是一个拥有15亿参数的Transformer模型,在零样本设置下,于8个测试的语言建模数据集中取得了7个的最先进结果,但仍未能充分拟合WebText数据。模型生成的样本反映出这些性能提升,其输出包含结构连贯的文本段落。这些发现表明,通过利用自然语言中固有的示范实例来学习执行任务,是一条极具前景的语言处理系统构建路径。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| coreference-resolution-on-winograd-schema | GPT-2-XL 1.5B | Accuracy: 70.7 |
| dialogue-state-tracking-on-simmc2-0 | GPT-2 | Act F1: 94.5 Slot F1: 81.7 |
| document-summarization-on-cnn-daily-mail | GPT-2 | ROUGE-1: 29.34 ROUGE-2: 8.27 ROUGE-L: 26.58 |
| language-modelling-on-enwiki8 | GPT-2 (48 layers, h=1600) | Bit per Character (BPC): 0.93 Number of params: 1542M |
| language-modelling-on-lambada | GPT-2 1.5B (Zero Shot) | Accuracy: 63.24 Perplexity: 8.63 |
| language-modelling-on-one-billion-word | GPT-2 | Number of params: 1.54B PPL: 42.16 |
| language-modelling-on-penn-treebank-word | GPT-2 | Params: 1542M Test perplexity: 35.76 |
| language-modelling-on-text8 | GPT-2 | Bit per Character (BPC): 0.98 Number of params: 1542M |
| language-modelling-on-wikitext-103 | GPT-2 Large | Number of params: 774M Test perplexity: 22.05 |
| language-modelling-on-wikitext-103 | GPT-2 Small | Number of params: 124M Test perplexity: 37.50 |
| language-modelling-on-wikitext-103 | GPT-2 Full | Number of params: 1542M Test perplexity: 17.48 |
| language-modelling-on-wikitext-103 | GPT-2 Medium | Number of params: 355M Test perplexity: 26.37 |
| language-modelling-on-wikitext-2 | GPT-2 (medium) | Number of params: 345M Test perplexity: 22.76 |
| language-modelling-on-wikitext-2 | GPT-2 (large) | Number of params: 762M Test perplexity: 19.93 |
| language-modelling-on-wikitext-2 | GPT-2 | Number of params: 1542M Test perplexity: 18.34 |
| language-modelling-on-wikitext-2 | GPT-2 (small) | Number of params: 117M Test perplexity: 29.41 |
| question-answering-on-fever | Zero-shot | EM: 50 |
| question-answering-on-webquestions | Zero-shot | EM: 43 |
| response-generation-on-simmc2-0 | GPT-2 | BLEU: 19.2 |
| sentiment-analysis-on-imdb | GPT-2 Finetuned | Accuracy: 92.36 |
| text-generation-on-openwebtext | GPT2-124M | eval_loss: 3.12 |