5 months ago

Language Models are Few-Shot Learners

Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Code Repositories

Samyu0304/thought-propagation

Mentioned in GitHub

mindspore-ai/models/tree/master/official/nlp/gpt

mindspore

ai21labs/lm-evaluation

Mentioned in GitHub

juletx/lm-evaluation-harness

pytorch

Mentioned in GitHub

um-arm-lab/efficient-eng-2-ltl

pytorch

Mentioned in GitHub

abhaskumarsinha/Corpus2GPT

pytorch

haiyang-w/git

pytorch

Mentioned in GitHub

neuralmagic/lm-evaluation-harness

pytorch

Mentioned in GitHub

ltruncel/Microsoft_Azure_50daysofudacity

Mentioned in GitHub

shreyashankar/gpt3-sandbox

Mentioned in GitHub

EightRice/atn_GPT-3

Mentioned in GitHub

EleutherAI/gpt-neo

Mentioned in GitHub

abhaskumarsinha/MinimalGPT

fywalter/label-bias

pytorch

Mentioned in GitHub

hazyresearch/ama_prompting

Mentioned in GitHub

hojjat-mokhtarabadi/promptsource

Mentioned in GitHub

openai/gpt-3

Official

Mentioned in GitHub

RUCAIBox/LLMBox

Mentioned in GitHub

allenai/macaw

pytorch

Mentioned in GitHub

PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/gpt-3

paddle

crazydigger/Callibration-of-GPT

pytorch

Mentioned in GitHub

smile-data/smile

pytorch

Mentioned in GitHub

karpathy/build-nanogpt

pytorch

Mentioned in GitHub

volcengine/vegiantmodel

pytorch

Mentioned in GitHub

national-center-for-ai-saudi-arabia/lm-evaluation-harness

jax

Mentioned in GitHub

asahi417/relbert

Mentioned in GitHub

openbiolink/promptsource

Mentioned in GitHub

facebookresearch/anli

pytorch

Mentioned in GitHub

ramanakshay/nanogpt

pytorch

Mentioned in GitHub

vilm-ai/viet-llm-eval

jax

Mentioned in GitHub

lambert-x/prolab

pytorch

Mentioned in GitHub

NVIDIA/NeMo-Curator

Mentioned in GitHub

roberttwomey/machine-imagination-workshop

Mentioned in GitHub

scrayish/ML_NLP

pytorch

Mentioned in GitHub

EleutherAI/lm_evaluation_harness

jax

Mentioned in GitHub

smarton-empower/smarton-ai

Mentioned in GitHub

ncoop57/gpt-code-clippy

jax

Mentioned in GitHub

insait-institute/lm-evaluation-harness-bg

jax

Mentioned in GitHub

kyegomez/GPT3

pytorch

VachanVY/gpt.jax

jax

Mentioned in GitHub

nlx-group/overlapy

Mentioned in GitHub

mbzuai-paris/lm-evaluation-harness-atlas-chat

pytorch

Mentioned in GitHub

ggml-org/llama.cpp

pytorch

Mentioned in GitHub

ggerganov/llama.cpp

pytorch

Mentioned in GitHub

bigscience-workshop/promptsource

Mentioned in GitHub

sambanova/lm-evaluation-harness

jax

Mentioned in GitHub

codedotal/gpt-code-clippy

jax

Mentioned in GitHub

grantslatton/llama.cpp

Mentioned in GitHub

postech-ami/smile-dataset

pytorch

Mentioned in GitHub

Sypherd/lm-evaluation-harness

pytorch

Mentioned in GitHub

x-lance/neusym-rag

Mentioned in GitHub

hilberthit/gpt-3

Mentioned in GitHub

tonyzhaozh/few-shot-learning

pytorch

Mentioned in GitHub

gmum/dl-mo-2021

Mentioned in GitHub

zphang/lm_evaluation_harness

Mentioned in GitHub

contextlab/abstract2paper

Mentioned in GitHub

turkunlp/megatron-deepspeed

pytorch

Mentioned in GitHub

Mind23-2/MindCode-138

mindspore

roberttwomey/machine-imagination-isea

Mentioned in GitHub

karpathy/llm.c

pytorch

Mentioned in GitHub

ethanjperez/true_few_shot

pytorch

Mentioned in GitHub

longhao-chen/aicas2024

pytorch

Mentioned in GitHub

EleutherAI/lm-evaluation-harness

jax

Mentioned in GitHub

milmor/GPT

opengptx/lm-evaluation-harness

pytorch

Mentioned in GitHub

bigscience-workshop/Megatron-DeepSpeed

pytorch

Mentioned in GitHub

asahi417/lmppl

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
answerability-prediction-on-peerqa	GPT-3.5-Turbo-0613-16k	Macro F1: 0.3304
common-sense-reasoning-on-arc-challenge	GPT-3 175B (0-shot)	Accuracy: 51.4
common-sense-reasoning-on-arc-challenge	GPT-3 175B (1 shot)	Accuracy: 53.2
common-sense-reasoning-on-arc-easy	GPT-3 175B (1 shot)	Accuracy: 71.2
common-sense-reasoning-on-arc-easy	GPT-3 175B (0-shot)	Accuracy: 68.8
common-sense-reasoning-on-record	GPT-3 Large 760M (0-shot)	EM: 82.1
common-sense-reasoning-on-winogrande	GPT-3 Large 760M (0-shot)	Accuracy: 57.4
common-sense-reasoning-on-winogrande	GPT-3 175B (0-shot)	Accuracy: 70.2
coreference-resolution-on-winograd-schema	GPT-3 175B (few-shot)	Accuracy: 80.1
few-shot-learning-on-medconceptsqa	gpt-3.5-turbo	Accuracy: 41.476
language-modelling-on-lambada	GPT-3 175B (Few-Shot)	Accuracy: 86.4 Perplexity: 1.92
language-modelling-on-lambada	GPT-3 13B (Zero-Shot)	Accuracy: 72.5 Perplexity: 3.56
language-modelling-on-lambada	GPT-3 2.7B (Zero-Shot)	Accuracy: 67.1 Perplexity: 4.60
language-modelling-on-lambada	GPT-3 6.7B (Zero-Shot)	Accuracy: 70.3 Perplexity: 4.00
language-modelling-on-lambada	GPT-3 175B (Zero-Shot)	Accuracy: 76.2 Perplexity: 3.00
language-modelling-on-penn-treebank-word	GPT-3 (Zero-Shot)	Params: 175000M Test perplexity: 20.5
multi-task-language-understanding-on-mmlu	GPT-3 175B (5-shot)	Average (%): 43.9
natural-language-inference-on-anli-test	GPT-3	A1: 36.8 A2: 34 A3: 40.2
natural-language-inference-on-commitmentbank	GPT-3 175B (Few-Shot)	Accuracy: 75.6
natural-language-inference-on-commitmentbank	GPT-3 175B (few-shot, k=32)	F1: 52
natural-language-inference-on-rte	GPT-3 175B (few-shot, k=32)	Accuracy: 69%
question-answering-on-boolq	GPT-3 175B (few-shot, k=32)	Accuracy: 76.4
question-answering-on-boolq	GPT-3 75B (0-shot)	Accuracy: 60.5
question-answering-on-copa	GPT-3 175B (few-shot, k=32)	Accuracy: 92
question-answering-on-copa	GPT-3 Large 760M (0-shot)	Accuracy: 73.0
question-answering-on-copa	GPT-3 13B (few-shot, k=32)	Accuracy: 86
question-answering-on-copa	GPT-3 175B (0-shot)	Accuracy: 91
question-answering-on-copa	GPT-3 175B (1-shot)	Accuracy: 87
question-answering-on-coqa	GPT-3 175B (few-shot, k=32)	Overall: 85
question-answering-on-drop-test	GPT-3 175B (few-shot, k=32)	F1: 36.5
question-answering-on-multirc	GPT-3 175B (Few-Shot)	F1: 75.4
question-answering-on-natural-questions	GPT-3 175B (Few-Shot, k=64)	EM: 29.9
question-answering-on-obqa	GPT-3 175B (zero-shot)	Accuracy: 57.6
question-answering-on-openbookqa	GPT-3 175B (few-shot, k=32)	Accuracy: 65.4
question-answering-on-peerqa	GPT-3.5-Turbo-0613-16k	AlignScore: 0.1378 Prometheus-2 Answer Correctness: 3.0408 Rouge-L: 0.2414
question-answering-on-piqa	GPT-3 175B (0-shot)	Accuracy: 81.0
question-answering-on-piqa	GPT-3 Large 760M (0-shot)	Accuracy: 72.9
question-answering-on-quac	GPT-3 175B (few-shot, k=32)	F1: 44.3
question-answering-on-race	GPT-3 175B (few-shot, k=32)	RACE-m: 58.1
question-answering-on-race	GPT-3 175B (Few-Shot)	RACE-h: 46.8
question-answering-on-story-cloze	GPT-3 175B (Few-Shot)	Accuracy: 87.7
question-answering-on-storycloze	GPT-3 Large 760M (zero-shot)	Accuracy: 72.4
question-answering-on-triviaqa	GPT-3 175B (Few-Shot)	EM: 71.2
question-answering-on-webquestions	GPT-3-175B (Few-Shot)	EM: 41.5
question-answering-on-webquestions	GPT-3-175B (Zero-Shot)	EM: 14.4
question-answering-on-webquestions	GPT-3-175B (One-Shot)	EM: 25.3
question-answering-on-webquestions	Few-shot	EM: 44.7
reading-comprehension-on-race	GPT-3 175B (zero-shot)	Accuracy (High): 45.5
reading-comprehension-on-race	GPT-3 175B (0-shot)	Accuracy (Middle): 58.4
unsupervised-machine-translation-on-wmt2014-1	GPT-3 175B (Few-Shot)	BLEU: 39.2
unsupervised-machine-translation-on-wmt2014-2	GPT-3 175B (Few-Shot)	BLEU: 32.6
unsupervised-machine-translation-on-wmt2016	GPT-3 175B (Few-Shot)	BLEU: 29.7
unsupervised-machine-translation-on-wmt2016-1	GPT-3 175B (Few-Shot)	BLEU: 40.6
unsupervised-machine-translation-on-wmt2016-2	GPT-3 175B (Few-Shot)	BLEU: 21
unsupervised-machine-translation-on-wmt2016-3	GPT-3 175B (Few-Shot)	BLEU: 39.5
word-sense-disambiguation-on-words-in-context	GPT-3 175B (few-shot, k=32)	Accuracy: 49.4
zero-shot-learning-on-medconceptsqa	gpt-3.5-turbo	Accuracy: 37.058

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Language Models are Few-Shot Learners

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters