4 个月前

语言模型是少样本学习者

语言模型是少样本学习者

摘要

近期的研究表明,通过在大规模文本语料库上进行预训练,然后针对特定任务进行微调,可以在许多自然语言处理(NLP)任务和基准测试中取得显著进展。尽管该方法在架构上通常对任务不敏感,但仍需要数千甚至数万个特定任务的微调数据集。相比之下,人类通常只需几个示例或简单的指令就能完成新的语言任务——这是当前的自然语言处理系统仍难以实现的能力。本文展示了通过大幅扩展语言模型可以显著提升其在任务不可知、少量样本条件下的性能,有时甚至能与先前的最佳微调方法相媲美。具体而言,我们训练了GPT-3,一个具有1750亿参数的自回归语言模型,其参数量是非稀疏语言模型中最大的,比任何前一个非稀疏语言模型多出10倍,并在少量样本条件下测试了其性能。对于所有任务,GPT-3均未进行任何梯度更新或微调,仅通过与模型的纯文本交互来指定任务和少量示例。GPT-3在多个自然语言处理数据集上表现出色,包括翻译、问答和完形填空任务,以及一些需要即时推理或领域适应的任务,如重组单词、在一个句子中使用新词或执行三位数算术运算。同时,我们也发现了一些GPT-3在少量样本学习方面仍然存在困难的数据集,以及一些由于在大规模网络语料库上训练而面临方法论问题的数据集。最后,我们发现GPT-3能够生成新闻文章样本,这些样本让人类评估者难以区分是由机器还是由人类撰写的。我们讨论了这一发现及其对社会的影响,并探讨了GPT-3的整体影响。

代码仓库

ai21labs/lm-evaluation
tf
GitHub 中提及
juletx/lm-evaluation-harness
pytorch
GitHub 中提及
um-arm-lab/efficient-eng-2-ltl
pytorch
GitHub 中提及
haiyang-w/git
pytorch
GitHub 中提及
neuralmagic/lm-evaluation-harness
pytorch
GitHub 中提及
EightRice/atn_GPT-3
tf
GitHub 中提及
EleutherAI/gpt-neo
tf
GitHub 中提及
fywalter/label-bias
pytorch
GitHub 中提及
openai/gpt-3
官方
GitHub 中提及
RUCAIBox/LLMBox
GitHub 中提及
allenai/macaw
pytorch
GitHub 中提及
crazydigger/Callibration-of-GPT
pytorch
GitHub 中提及
smile-data/smile
pytorch
GitHub 中提及
karpathy/build-nanogpt
pytorch
GitHub 中提及
volcengine/vegiantmodel
pytorch
GitHub 中提及
asahi417/relbert
GitHub 中提及
openbiolink/promptsource
GitHub 中提及
facebookresearch/anli
pytorch
GitHub 中提及
ramanakshay/nanogpt
pytorch
GitHub 中提及
vilm-ai/viet-llm-eval
jax
GitHub 中提及
lambert-x/prolab
pytorch
GitHub 中提及
NVIDIA/NeMo-Curator
GitHub 中提及
scrayish/ML_NLP
pytorch
GitHub 中提及
ncoop57/gpt-code-clippy
jax
GitHub 中提及
VachanVY/gpt.jax
jax
GitHub 中提及
nlx-group/overlapy
GitHub 中提及
ggml-org/llama.cpp
pytorch
GitHub 中提及
ggerganov/llama.cpp
pytorch
GitHub 中提及
codedotal/gpt-code-clippy
jax
GitHub 中提及
grantslatton/llama.cpp
GitHub 中提及
postech-ami/smile-dataset
pytorch
GitHub 中提及
Sypherd/lm-evaluation-harness
pytorch
GitHub 中提及
x-lance/neusym-rag
GitHub 中提及
hilberthit/gpt-3
GitHub 中提及
tonyzhaozh/few-shot-learning
pytorch
GitHub 中提及
gmum/dl-mo-2021
GitHub 中提及
turkunlp/megatron-deepspeed
pytorch
GitHub 中提及
karpathy/llm.c
pytorch
GitHub 中提及
ethanjperez/true_few_shot
pytorch
GitHub 中提及
longhao-chen/aicas2024
pytorch
GitHub 中提及
opengptx/lm-evaluation-harness
pytorch
GitHub 中提及
asahi417/lmppl
GitHub 中提及

基准测试

基准方法指标
answerability-prediction-on-peerqaGPT-3.5-Turbo-0613-16k
Macro F1: 0.3304
common-sense-reasoning-on-arc-challengeGPT-3 175B (0-shot)
Accuracy: 51.4
common-sense-reasoning-on-arc-challengeGPT-3 175B (1 shot)
Accuracy: 53.2
common-sense-reasoning-on-arc-easyGPT-3 175B (1 shot)
Accuracy: 71.2
common-sense-reasoning-on-arc-easyGPT-3 175B (0-shot)
Accuracy: 68.8
common-sense-reasoning-on-recordGPT-3 Large 760M (0-shot)
EM: 82.1
common-sense-reasoning-on-winograndeGPT-3 Large 760M (0-shot)
Accuracy: 57.4
common-sense-reasoning-on-winograndeGPT-3 175B (0-shot)
Accuracy: 70.2
coreference-resolution-on-winograd-schemaGPT-3 175B (few-shot)
Accuracy: 80.1
few-shot-learning-on-medconceptsqagpt-3.5-turbo
Accuracy: 41.476
language-modelling-on-lambadaGPT-3 175B (Few-Shot)
Accuracy: 86.4
Perplexity: 1.92
language-modelling-on-lambadaGPT-3 13B (Zero-Shot)
Accuracy: 72.5
Perplexity: 3.56
language-modelling-on-lambadaGPT-3 2.7B (Zero-Shot)
Accuracy: 67.1
Perplexity: 4.60
language-modelling-on-lambadaGPT-3 6.7B (Zero-Shot)
Accuracy: 70.3
Perplexity: 4.00
language-modelling-on-lambadaGPT-3 175B (Zero-Shot)
Accuracy: 76.2
Perplexity: 3.00
language-modelling-on-penn-treebank-wordGPT-3 (Zero-Shot)
Params: 175000M
Test perplexity: 20.5
multi-task-language-understanding-on-mmluGPT-3 175B (5-shot)
Average (%): 43.9
natural-language-inference-on-anli-testGPT-3
A1: 36.8
A2: 34
A3: 40.2
natural-language-inference-on-commitmentbankGPT-3 175B (Few-Shot)
Accuracy: 75.6
natural-language-inference-on-commitmentbankGPT-3 175B (few-shot, k=32)
F1: 52
natural-language-inference-on-rteGPT-3 175B (few-shot, k=32)
Accuracy: 69%
question-answering-on-boolqGPT-3 175B (few-shot, k=32)
Accuracy: 76.4
question-answering-on-boolqGPT-3 75B (0-shot)
Accuracy: 60.5
question-answering-on-copaGPT-3 175B (few-shot, k=32)
Accuracy: 92
question-answering-on-copaGPT-3 Large 760M (0-shot)
Accuracy: 73.0
question-answering-on-copaGPT-3 13B (few-shot, k=32)
Accuracy: 86
question-answering-on-copaGPT-3 175B (0-shot)
Accuracy: 91
question-answering-on-copaGPT-3 175B (1-shot)
Accuracy: 87
question-answering-on-coqaGPT-3 175B (few-shot, k=32)
Overall: 85
question-answering-on-drop-testGPT-3 175B (few-shot, k=32)
F1: 36.5
question-answering-on-multircGPT-3 175B (Few-Shot)
F1: 75.4
question-answering-on-natural-questionsGPT-3 175B (Few-Shot, k=64)
EM: 29.9
question-answering-on-obqaGPT-3 175B (zero-shot)
Accuracy: 57.6
question-answering-on-openbookqaGPT-3 175B (few-shot, k=32)
Accuracy: 65.4
question-answering-on-peerqaGPT-3.5-Turbo-0613-16k
AlignScore: 0.1378
Prometheus-2 Answer Correctness: 3.0408
Rouge-L: 0.2414
question-answering-on-piqaGPT-3 175B (0-shot)
Accuracy: 81.0
question-answering-on-piqaGPT-3 Large 760M (0-shot)
Accuracy: 72.9
question-answering-on-quacGPT-3 175B (few-shot, k=32)
F1: 44.3
question-answering-on-raceGPT-3 175B (few-shot, k=32)
RACE-m: 58.1
question-answering-on-raceGPT-3 175B (Few-Shot)
RACE-h: 46.8
question-answering-on-story-clozeGPT-3 175B (Few-Shot)
Accuracy: 87.7
question-answering-on-storyclozeGPT-3 Large 760M (zero-shot)
Accuracy: 72.4
question-answering-on-triviaqaGPT-3 175B (Few-Shot)
EM: 71.2
question-answering-on-webquestionsGPT-3-175B (Few-Shot)
EM: 41.5
question-answering-on-webquestionsGPT-3-175B (Zero-Shot)
EM: 14.4
question-answering-on-webquestionsGPT-3-175B (One-Shot)
EM: 25.3
question-answering-on-webquestionsFew-shot
EM: 44.7
reading-comprehension-on-raceGPT-3 175B (zero-shot)
Accuracy (High): 45.5
reading-comprehension-on-raceGPT-3 175B (0-shot)
Accuracy (Middle): 58.4
unsupervised-machine-translation-on-wmt2014-1GPT-3 175B (Few-Shot)
BLEU: 39.2
unsupervised-machine-translation-on-wmt2014-2GPT-3 175B (Few-Shot)
BLEU: 32.6
unsupervised-machine-translation-on-wmt2016GPT-3 175B (Few-Shot)
BLEU: 29.7
unsupervised-machine-translation-on-wmt2016-1GPT-3 175B (Few-Shot)
BLEU: 40.6
unsupervised-machine-translation-on-wmt2016-2GPT-3 175B (Few-Shot)
BLEU: 21
unsupervised-machine-translation-on-wmt2016-3GPT-3 175B (Few-Shot)
BLEU: 39.5
word-sense-disambiguation-on-words-in-contextGPT-3 175B (few-shot, k=32)
Accuracy: 49.4
zero-shot-learning-on-medconceptsqagpt-3.5-turbo
Accuracy: 37.058

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
语言模型是少样本学习者 | 论文 | HyperAI超神经