
摘要
近期的研究表明,通过在大规模文本语料库上进行预训练,然后针对特定任务进行微调,可以在许多自然语言处理(NLP)任务和基准测试中取得显著进展。尽管该方法在架构上通常对任务不敏感,但仍需要数千甚至数万个特定任务的微调数据集。相比之下,人类通常只需几个示例或简单的指令就能完成新的语言任务——这是当前的自然语言处理系统仍难以实现的能力。本文展示了通过大幅扩展语言模型可以显著提升其在任务不可知、少量样本条件下的性能,有时甚至能与先前的最佳微调方法相媲美。具体而言,我们训练了GPT-3,一个具有1750亿参数的自回归语言模型,其参数量是非稀疏语言模型中最大的,比任何前一个非稀疏语言模型多出10倍,并在少量样本条件下测试了其性能。对于所有任务,GPT-3均未进行任何梯度更新或微调,仅通过与模型的纯文本交互来指定任务和少量示例。GPT-3在多个自然语言处理数据集上表现出色,包括翻译、问答和完形填空任务,以及一些需要即时推理或领域适应的任务,如重组单词、在一个句子中使用新词或执行三位数算术运算。同时,我们也发现了一些GPT-3在少量样本学习方面仍然存在困难的数据集,以及一些由于在大规模网络语料库上训练而面临方法论问题的数据集。最后,我们发现GPT-3能够生成新闻文章样本,这些样本让人类评估者难以区分是由机器还是由人类撰写的。我们讨论了这一发现及其对社会的影响,并探讨了GPT-3的整体影响。
代码仓库
Samyu0304/thought-propagation
GitHub 中提及
ai21labs/lm-evaluation
tf
GitHub 中提及
juletx/lm-evaluation-harness
pytorch
GitHub 中提及
um-arm-lab/efficient-eng-2-ltl
pytorch
GitHub 中提及
abhaskumarsinha/Corpus2GPT
pytorch
haiyang-w/git
pytorch
GitHub 中提及
neuralmagic/lm-evaluation-harness
pytorch
GitHub 中提及
ltruncel/Microsoft_Azure_50daysofudacity
tf
GitHub 中提及
shreyashankar/gpt3-sandbox
GitHub 中提及
EightRice/atn_GPT-3
tf
GitHub 中提及
EleutherAI/gpt-neo
tf
GitHub 中提及
fywalter/label-bias
pytorch
GitHub 中提及
hazyresearch/ama_prompting
GitHub 中提及
hojjat-mokhtarabadi/promptsource
GitHub 中提及
openai/gpt-3
官方
GitHub 中提及
RUCAIBox/LLMBox
GitHub 中提及
allenai/macaw
pytorch
GitHub 中提及
crazydigger/Callibration-of-GPT
pytorch
GitHub 中提及
smile-data/smile
pytorch
GitHub 中提及
karpathy/build-nanogpt
pytorch
GitHub 中提及
volcengine/vegiantmodel
pytorch
GitHub 中提及
asahi417/relbert
GitHub 中提及
openbiolink/promptsource
GitHub 中提及
facebookresearch/anli
pytorch
GitHub 中提及
ramanakshay/nanogpt
pytorch
GitHub 中提及
vilm-ai/viet-llm-eval
jax
GitHub 中提及
lambert-x/prolab
pytorch
GitHub 中提及
NVIDIA/NeMo-Curator
GitHub 中提及
scrayish/ML_NLP
pytorch
GitHub 中提及
EleutherAI/lm_evaluation_harness
jax
GitHub 中提及
smarton-empower/smarton-ai
GitHub 中提及
ncoop57/gpt-code-clippy
jax
GitHub 中提及
insait-institute/lm-evaluation-harness-bg
jax
GitHub 中提及
kyegomez/GPT3
pytorch
VachanVY/gpt.jax
jax
GitHub 中提及
nlx-group/overlapy
GitHub 中提及
mbzuai-paris/lm-evaluation-harness-atlas-chat
pytorch
GitHub 中提及
ggml-org/llama.cpp
pytorch
GitHub 中提及
ggerganov/llama.cpp
pytorch
GitHub 中提及
bigscience-workshop/promptsource
GitHub 中提及
sambanova/lm-evaluation-harness
jax
GitHub 中提及
codedotal/gpt-code-clippy
jax
GitHub 中提及
grantslatton/llama.cpp
GitHub 中提及
postech-ami/smile-dataset
pytorch
GitHub 中提及
Sypherd/lm-evaluation-harness
pytorch
GitHub 中提及
x-lance/neusym-rag
GitHub 中提及
hilberthit/gpt-3
GitHub 中提及
tonyzhaozh/few-shot-learning
pytorch
GitHub 中提及
gmum/dl-mo-2021
GitHub 中提及
zphang/lm_evaluation_harness
GitHub 中提及
contextlab/abstract2paper
GitHub 中提及
turkunlp/megatron-deepspeed
pytorch
GitHub 中提及
Mind23-2/MindCode-138
mindspore
roberttwomey/machine-imagination-isea
GitHub 中提及
karpathy/llm.c
pytorch
GitHub 中提及
ethanjperez/true_few_shot
pytorch
GitHub 中提及
longhao-chen/aicas2024
pytorch
GitHub 中提及
EleutherAI/lm-evaluation-harness
jax
GitHub 中提及
opengptx/lm-evaluation-harness
pytorch
GitHub 中提及
bigscience-workshop/Megatron-DeepSpeed
pytorch
GitHub 中提及
asahi417/lmppl
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| answerability-prediction-on-peerqa | GPT-3.5-Turbo-0613-16k | Macro F1: 0.3304 |
| common-sense-reasoning-on-arc-challenge | GPT-3 175B (0-shot) | Accuracy: 51.4 |
| common-sense-reasoning-on-arc-challenge | GPT-3 175B (1 shot) | Accuracy: 53.2 |
| common-sense-reasoning-on-arc-easy | GPT-3 175B (1 shot) | Accuracy: 71.2 |
| common-sense-reasoning-on-arc-easy | GPT-3 175B (0-shot) | Accuracy: 68.8 |
| common-sense-reasoning-on-record | GPT-3 Large 760M (0-shot) | EM: 82.1 |
| common-sense-reasoning-on-winogrande | GPT-3 Large 760M (0-shot) | Accuracy: 57.4 |
| common-sense-reasoning-on-winogrande | GPT-3 175B (0-shot) | Accuracy: 70.2 |
| coreference-resolution-on-winograd-schema | GPT-3 175B (few-shot) | Accuracy: 80.1 |
| few-shot-learning-on-medconceptsqa | gpt-3.5-turbo | Accuracy: 41.476 |
| language-modelling-on-lambada | GPT-3 175B (Few-Shot) | Accuracy: 86.4 Perplexity: 1.92 |
| language-modelling-on-lambada | GPT-3 13B (Zero-Shot) | Accuracy: 72.5 Perplexity: 3.56 |
| language-modelling-on-lambada | GPT-3 2.7B (Zero-Shot) | Accuracy: 67.1 Perplexity: 4.60 |
| language-modelling-on-lambada | GPT-3 6.7B (Zero-Shot) | Accuracy: 70.3 Perplexity: 4.00 |
| language-modelling-on-lambada | GPT-3 175B (Zero-Shot) | Accuracy: 76.2 Perplexity: 3.00 |
| language-modelling-on-penn-treebank-word | GPT-3 (Zero-Shot) | Params: 175000M Test perplexity: 20.5 |
| multi-task-language-understanding-on-mmlu | GPT-3 175B (5-shot) | Average (%): 43.9 |
| natural-language-inference-on-anli-test | GPT-3 | A1: 36.8 A2: 34 A3: 40.2 |
| natural-language-inference-on-commitmentbank | GPT-3 175B (Few-Shot) | Accuracy: 75.6 |
| natural-language-inference-on-commitmentbank | GPT-3 175B (few-shot, k=32) | F1: 52 |
| natural-language-inference-on-rte | GPT-3 175B (few-shot, k=32) | Accuracy: 69% |
| question-answering-on-boolq | GPT-3 175B (few-shot, k=32) | Accuracy: 76.4 |
| question-answering-on-boolq | GPT-3 75B (0-shot) | Accuracy: 60.5 |
| question-answering-on-copa | GPT-3 175B (few-shot, k=32) | Accuracy: 92 |
| question-answering-on-copa | GPT-3 Large 760M (0-shot) | Accuracy: 73.0 |
| question-answering-on-copa | GPT-3 13B (few-shot, k=32) | Accuracy: 86 |
| question-answering-on-copa | GPT-3 175B (0-shot) | Accuracy: 91 |
| question-answering-on-copa | GPT-3 175B (1-shot) | Accuracy: 87 |
| question-answering-on-coqa | GPT-3 175B (few-shot, k=32) | Overall: 85 |
| question-answering-on-drop-test | GPT-3 175B (few-shot, k=32) | F1: 36.5 |
| question-answering-on-multirc | GPT-3 175B (Few-Shot) | F1: 75.4 |
| question-answering-on-natural-questions | GPT-3 175B (Few-Shot, k=64) | EM: 29.9 |
| question-answering-on-obqa | GPT-3 175B (zero-shot) | Accuracy: 57.6 |
| question-answering-on-openbookqa | GPT-3 175B (few-shot, k=32) | Accuracy: 65.4 |
| question-answering-on-peerqa | GPT-3.5-Turbo-0613-16k | AlignScore: 0.1378 Prometheus-2 Answer Correctness: 3.0408 Rouge-L: 0.2414 |
| question-answering-on-piqa | GPT-3 175B (0-shot) | Accuracy: 81.0 |
| question-answering-on-piqa | GPT-3 Large 760M (0-shot) | Accuracy: 72.9 |
| question-answering-on-quac | GPT-3 175B (few-shot, k=32) | F1: 44.3 |
| question-answering-on-race | GPT-3 175B (few-shot, k=32) | RACE-m: 58.1 |
| question-answering-on-race | GPT-3 175B (Few-Shot) | RACE-h: 46.8 |
| question-answering-on-story-cloze | GPT-3 175B (Few-Shot) | Accuracy: 87.7 |
| question-answering-on-storycloze | GPT-3 Large 760M (zero-shot) | Accuracy: 72.4 |
| question-answering-on-triviaqa | GPT-3 175B (Few-Shot) | EM: 71.2 |
| question-answering-on-webquestions | GPT-3-175B (Few-Shot) | EM: 41.5 |
| question-answering-on-webquestions | GPT-3-175B (Zero-Shot) | EM: 14.4 |
| question-answering-on-webquestions | GPT-3-175B (One-Shot) | EM: 25.3 |
| question-answering-on-webquestions | Few-shot | EM: 44.7 |
| reading-comprehension-on-race | GPT-3 175B (zero-shot) | Accuracy (High): 45.5 |
| reading-comprehension-on-race | GPT-3 175B (0-shot) | Accuracy (Middle): 58.4 |
| unsupervised-machine-translation-on-wmt2014-1 | GPT-3 175B (Few-Shot) | BLEU: 39.2 |
| unsupervised-machine-translation-on-wmt2014-2 | GPT-3 175B (Few-Shot) | BLEU: 32.6 |
| unsupervised-machine-translation-on-wmt2016 | GPT-3 175B (Few-Shot) | BLEU: 29.7 |
| unsupervised-machine-translation-on-wmt2016-1 | GPT-3 175B (Few-Shot) | BLEU: 40.6 |
| unsupervised-machine-translation-on-wmt2016-2 | GPT-3 175B (Few-Shot) | BLEU: 21 |
| unsupervised-machine-translation-on-wmt2016-3 | GPT-3 175B (Few-Shot) | BLEU: 39.5 |
| word-sense-disambiguation-on-words-in-context | GPT-3 175B (few-shot, k=32) | Accuracy: 49.4 |
| zero-shot-learning-on-medconceptsqa | gpt-3.5-turbo | Accuracy: 37.058 |