Command Palette
Search for a command to run...
Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei

Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| answerability-prediction-on-peerqa | GPT-3.5-Turbo-0613-16k | Macro F1: 0.3304 |
| common-sense-reasoning-on-arc-challenge | GPT-3 175B (0-shot) | Accuracy: 51.4 |
| common-sense-reasoning-on-arc-challenge | GPT-3 175B (1 shot) | Accuracy: 53.2 |
| common-sense-reasoning-on-arc-easy | GPT-3 175B (1 shot) | Accuracy: 71.2 |
| common-sense-reasoning-on-arc-easy | GPT-3 175B (0-shot) | Accuracy: 68.8 |
| common-sense-reasoning-on-record | GPT-3 Large 760M (0-shot) | EM: 82.1 |
| common-sense-reasoning-on-winogrande | GPT-3 Large 760M (0-shot) | Accuracy: 57.4 |
| common-sense-reasoning-on-winogrande | GPT-3 175B (0-shot) | Accuracy: 70.2 |
| coreference-resolution-on-winograd-schema | GPT-3 175B (few-shot) | Accuracy: 80.1 |
| few-shot-learning-on-medconceptsqa | gpt-3.5-turbo | Accuracy: 41.476 |
| language-modelling-on-lambada | GPT-3 175B (Few-Shot) | Accuracy: 86.4 Perplexity: 1.92 |
| language-modelling-on-lambada | GPT-3 13B (Zero-Shot) | Accuracy: 72.5 Perplexity: 3.56 |
| language-modelling-on-lambada | GPT-3 2.7B (Zero-Shot) | Accuracy: 67.1 Perplexity: 4.60 |
| language-modelling-on-lambada | GPT-3 6.7B (Zero-Shot) | Accuracy: 70.3 Perplexity: 4.00 |
| language-modelling-on-lambada | GPT-3 175B (Zero-Shot) | Accuracy: 76.2 Perplexity: 3.00 |
| language-modelling-on-penn-treebank-word | GPT-3 (Zero-Shot) | Params: 175000M Test perplexity: 20.5 |
| multi-task-language-understanding-on-mmlu | GPT-3 175B (5-shot) | Average (%): 43.9 |
| natural-language-inference-on-anli-test | GPT-3 | A1: 36.8 A2: 34 A3: 40.2 |
| natural-language-inference-on-commitmentbank | GPT-3 175B (Few-Shot) | Accuracy: 75.6 |
| natural-language-inference-on-commitmentbank | GPT-3 175B (few-shot, k=32) | F1: 52 |
| natural-language-inference-on-rte | GPT-3 175B (few-shot, k=32) | Accuracy: 69% |
| question-answering-on-boolq | GPT-3 175B (few-shot, k=32) | Accuracy: 76.4 |
| question-answering-on-boolq | GPT-3 75B (0-shot) | Accuracy: 60.5 |
| question-answering-on-copa | GPT-3 175B (few-shot, k=32) | Accuracy: 92 |
| question-answering-on-copa | GPT-3 Large 760M (0-shot) | Accuracy: 73.0 |
| question-answering-on-copa | GPT-3 13B (few-shot, k=32) | Accuracy: 86 |
| question-answering-on-copa | GPT-3 175B (0-shot) | Accuracy: 91 |
| question-answering-on-copa | GPT-3 175B (1-shot) | Accuracy: 87 |
| question-answering-on-coqa | GPT-3 175B (few-shot, k=32) | Overall: 85 |
| question-answering-on-drop-test | GPT-3 175B (few-shot, k=32) | F1: 36.5 |
| question-answering-on-multirc | GPT-3 175B (Few-Shot) | F1: 75.4 |
| question-answering-on-natural-questions | GPT-3 175B (Few-Shot, k=64) | EM: 29.9 |
| question-answering-on-obqa | GPT-3 175B (zero-shot) | Accuracy: 57.6 |
| question-answering-on-openbookqa | GPT-3 175B (few-shot, k=32) | Accuracy: 65.4 |
| question-answering-on-peerqa | GPT-3.5-Turbo-0613-16k | AlignScore: 0.1378 Prometheus-2 Answer Correctness: 3.0408 Rouge-L: 0.2414 |
| question-answering-on-piqa | GPT-3 175B (0-shot) | Accuracy: 81.0 |
| question-answering-on-piqa | GPT-3 Large 760M (0-shot) | Accuracy: 72.9 |
| question-answering-on-quac | GPT-3 175B (few-shot, k=32) | F1: 44.3 |
| question-answering-on-race | GPT-3 175B (few-shot, k=32) | RACE-m: 58.1 |
| question-answering-on-race | GPT-3 175B (Few-Shot) | RACE-h: 46.8 |
| question-answering-on-story-cloze | GPT-3 175B (Few-Shot) | Accuracy: 87.7 |
| question-answering-on-storycloze | GPT-3 Large 760M (zero-shot) | Accuracy: 72.4 |
| question-answering-on-triviaqa | GPT-3 175B (Few-Shot) | EM: 71.2 |
| question-answering-on-webquestions | GPT-3-175B (Few-Shot) | EM: 41.5 |
| question-answering-on-webquestions | GPT-3-175B (Zero-Shot) | EM: 14.4 |
| question-answering-on-webquestions | GPT-3-175B (One-Shot) | EM: 25.3 |
| question-answering-on-webquestions | Few-shot | EM: 44.7 |
| reading-comprehension-on-race | GPT-3 175B (zero-shot) | Accuracy (High): 45.5 |
| reading-comprehension-on-race | GPT-3 175B (0-shot) | Accuracy (Middle): 58.4 |
| unsupervised-machine-translation-on-wmt2014-1 | GPT-3 175B (Few-Shot) | BLEU: 39.2 |
| unsupervised-machine-translation-on-wmt2014-2 | GPT-3 175B (Few-Shot) | BLEU: 32.6 |
| unsupervised-machine-translation-on-wmt2016 | GPT-3 175B (Few-Shot) | BLEU: 29.7 |
| unsupervised-machine-translation-on-wmt2016-1 | GPT-3 175B (Few-Shot) | BLEU: 40.6 |
| unsupervised-machine-translation-on-wmt2016-2 | GPT-3 175B (Few-Shot) | BLEU: 21 |
| unsupervised-machine-translation-on-wmt2016-3 | GPT-3 175B (Few-Shot) | BLEU: 39.5 |
| word-sense-disambiguation-on-words-in-context | GPT-3 175B (few-shot, k=32) | Accuracy: 49.4 |
| zero-shot-learning-on-medconceptsqa | gpt-3.5-turbo | Accuracy: 37.058 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.