
摘要
语言模型通过利用大量的人类书面知识库,为实现智能通信系统迈出了重要一步,能够更好地预测和理解世界。在本文中,我们分析了基于Transformer架构的语言模型在不同规模下的性能表现——从参数量为数千万的模型到参数量达到2800亿的模型Gopher。这些模型在152个多样化的任务上进行了评估,大多数任务上均达到了当前最佳性能。规模带来的收益在诸如阅读理解、事实核查和有害语言识别等领域最为显著,但在逻辑推理和数学推理方面则相对较小。我们对训练数据集和模型的行为进行了全面分析,探讨了模型规模与偏见及有害内容之间的关系。最后,我们讨论了语言模型在人工智能安全领域的应用以及如何减轻下游风险。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| abstract-algebra-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 25.0 |
| analogical-similarity-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 17.2 |
| analytic-entailment-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 53.0 |
| anatomy-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 56.3 |
| astronomy-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 65.8 |
| business-ethics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 70.0 |
| clinical-knowledge-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 67.2 |
| college-biology-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 70.8 |
| college-chemistry-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 45.0 |
| college-computer-science-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 49 |
| college-mathematics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 37.0 |
| college-medicine-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 60.1 |
| college-physics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 34.3 |
| common-sense-reasoning-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 45.5 |
| common-sense-reasoning-on-big-bench-causal | Gopher-280B (few-shot, k=5) | Accuracy: 50.8 |
| common-sense-reasoning-on-big-bench-date | Gopher-280B (few-shot, k=5) | Accuracy: 44.1 |
| common-sense-reasoning-on-big-bench-known | Gopher-280B (few-shot, k=5) | Accuracy: 63.6 |
| common-sense-reasoning-on-big-bench-logical | Gopher-280B (few-shot, k=5) | Accuracy: 36.4 |
| common-sense-reasoning-on-big-bench-sports | Gopher-280B (few-shot, k=5) | Accuracy: 54.9 |
| common-sense-reasoning-on-big-bench-winowhy | Gopher-280B (few-shot, k=5) | Accuracy: 56.7 |
| common-sense-reasoning-on-winogrande | Gopher 280B (0-shot) | Accuracy: 70.1 |
| computer-security-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 65.0 |
| conceptual-physics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 49.4 |
| crash-blossom-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 63.6 |
| crass-ai-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 56.8 |
| dark-humor-detection-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 83.1 |
| discourse-marker-prediction-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 11.7 |
| econometrics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 43 |
| electrical-engineering-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 60 |
| elementary-mathematics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 33.6 |
| empirical-judgments-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 52.5 |
| english-proverbs-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 57.6 |
| entailed-polarity-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 89.5 |
| epistemic-reasoning-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 56.4 |
| evaluating-information-essentiality-on-big | Gopher-280B (few-shot, k=5) | Accuracy: 16.7 |
| fantasy-reasoning-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 64.1 |
| fever-2-way-on-big-bench | Gopher-280B (few-shot, k=10) | Accuracy: 77.5 |
| fever-3-way-on-big-bench | Gopher-280B (few-shot, k=15) | Accuracy: 77.5 |
| figure-of-speech-detection-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 52.7 |
| formal-logic-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 35.7 |
| general-knowledge-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 93.9 |
| global-facts-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 38.0 |
| gre-reading-comprehension-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 27.3 |
| high-school-biology-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 71.3 |
| high-school-chemistry-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 47.8 |
| high-school-computer-science-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 54.0 |
| high-school-european-history-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 72.1 |
| high-school-geography-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 76.8 |
| high-school-government-and-politics-on-big | Gopher-280B (few-shot, k=5) | Accuracy : 83.9 |
| high-school-macroeconomics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 65.1 |
| high-school-mathematics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 23.7 |
| high-school-microeconomics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 66.4 |
| high-school-physics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 33.8 |
| high-school-psychology-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 81.8 |
| high-school-statistics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 50 |
| high-school-us-history-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 78.9 |
| high-school-world-history-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 75.1 |
| human-aging-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 66.4 |
| human-organs-senses-multiple-choice-on-big | Gopher-280B (few-shot, k=5) | Accuracy : 84.8 |
| human-sexuality-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 67.2 |
| identify-odd-metapor-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 38.6 |
| implicatures-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 62.0 |
| implicit-relations-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 36.4 |
| intent-recognition-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 88.7 |
| international-law-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 77.7 |
| irony-identification-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 69.7 |
| jurisprudence-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 71.3 |
| lambada-on-big-bench | Gopher-280B (zero-shot) | Accuracy : 74.5 |
| language-modelling-on-arxiv | Gopher | BPB: 0.662 |
| language-modelling-on-bookcorpus2 | Gopher | BPB: 0.741 |
| language-modelling-on-books3 | Gopher | BPB: 0.712 |
| language-modelling-on-curation-corpus | Gopher | BPB: 0.475 |
| language-modelling-on-dm-mathematics | Gopher | BPB: 1.14 |
| language-modelling-on-freelaw | Gopher | BPB: 0.513 |
| language-modelling-on-github | Gopher | BPB: 0.377 |
| language-modelling-on-gutenberg-pg-19 | Gopher | BPB: 0.656 |
| language-modelling-on-hackernews | Gopher | BPB: 0.890 |
| language-modelling-on-nih-exporter | Gopher | BPB: 0.590 |
| language-modelling-on-opensubtitles | Gopher | BPB: 0.899 |
| language-modelling-on-openwebtext2 | Gopher | BPB: 0.677 |
| language-modelling-on-philpapers | Gopher | BPB: 0.695 |
| language-modelling-on-pile-cc | Gopher | BPB: 0.691 |
| language-modelling-on-pubmed-abstracts | Gopher | BPB: 0.577 |
| language-modelling-on-pubmed-central | Gopher | BPB: 0.525 |
| language-modelling-on-stackexchange | Gopher | BPB: 0.641 |
| language-modelling-on-ubuntu-irc | Gopher | BPB: 1.09 |
| language-modelling-on-uspto-backgrounds | Gopher | BPB: 0.546 |
| logical-args-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 59.1 |
| logical-fallacies-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 72.4 |
| logical-reasoning-on-big-bench-formal | Gopher-280B (few-shot, k=5) | Accuracy: 50.7 |
| logical-reasoning-on-big-bench-logic-grid | Gopher-280B (few-shot, k=5) | Accuracy: 35.1 |
| logical-reasoning-on-big-bench-logical | Gopher-280B (few-shot, k=5) | Accuracy: 58.9 |
| logical-reasoning-on-big-bench-penguins-in-a | Gopher-280B (few-shot, k=5) | Accuracy: 40.6 |
| logical-reasoning-on-big-bench-reasoning | Gopher-280B (few-shot, k=5) | Accuracy: 49.2 |
| logical-reasoning-on-big-bench-strategyqa | Gopher-280B (few-shot, k=5) | Accuracy: 61.0 |
| logical-reasoning-on-big-bench-temporal | Gopher-280B (few-shot, k=5) | Accuracy: 19.0 |
| machine-learning-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 41.1 |
| management-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 77.7 |
| marketing-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 83.3 |
| mathematical-induction-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 57.6 |
| medical-genetics-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 69.0 |
| memorization-on-big-bench-hindu-knowledge | Gopher-280B (few-shot, k=5) | Accuracy: 80 |
| metaphor-boolean-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 59.3 |
| miscellaneous-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 75.7 |
| misconceptions-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 61.7 |
| moral-disputes-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 66.8 |
| moral-permissibility-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 55.1 |
| moral-scenarios-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 40.2 |
| movie-dialog-same-or-different-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 50.7 |
| multi-task-language-understanding-on-mmlu | Gopher 7.1B (5-shot) | Average (%): 29.5 |
| multiple-choice-question-answering-mcqa-on-27 | Gopher-280B (few-shot, k=5) | Accuracy: 51.7 |
| multiple-choice-question-answering-mcqa-on-28 | Gopher-280B (few-shot, k=5) | Accuracy: 50.5 |
| multiple-choice-question-answering-mcqa-on-29 | Gopher-280B (few-shot, k=5) | Accuracy: 51.1 |
| multiple-choice-question-answering-mcqa-on-30 | Gopher-280B (few-shot, k=5) | Accuracy: 38.6 |
| multiple-choice-question-answering-mcqa-on-31 | Gopher-280B (few-shot, k=5) | Accuracy: 59.1 |
| natural-questions-on-big-bench | Gopher-280B (few-shot, k=64) | Accuracy: 28.2 |
| nonsense-words-grammar-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 61.4 |
| nutrition-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 69.9 |
| odd-one-out-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 32.5 |
| philosophy-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 68.8 |
| phrase-relatedness-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 81.8 |
| physical-intuition-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 59.7 |
| physics-mc-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 50.9 |
| prehistory-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 67.6 |
| presuppositions-as-nli-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 34.0 |
| professional-accounting-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 44.3 |
| professional-law-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 44.5 |
| professional-medicine-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 64.0 |
| professional-psychology-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 68.1 |
| public-relations-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 71.8 |
| question-answering-on-boolq | Gopher (zero-shot) | Accuracy: 79.3 |
| question-answering-on-natural-questions | Gopher (few-shot, k=64) | EM: 28.2 |
| question-answering-on-piqa | Gopher 280B (0-shot) | Accuracy: 81.8 |
| question-answering-on-social-iqa | Gopher (zero-shot) | Accuracy: 50.6 |
| question-answering-on-truthfulqa | Gopher 280B (zero-shot, QA prompts) | MC1: 0. 27 |
| question-answering-on-truthfulqa | Gopher 7.1 (zero-shot, QA prompts) | MC1: 0.25 |
| question-answering-on-truthfulqa | Gopher 7.1B (zero-shot, Our Prompt + Choices) | MC1: 0.23 |
| question-answering-on-truthfulqa | Gopher 1.4 (zero-shot, QA prompts) | MC1: 0.23 |
| question-answering-on-truthfulqa | Gopher 280B (zero-shot, Our Prompt + Choices) | MC1: 0.295 |
| question-answering-on-truthfulqa | Gopher 1.4B (zero-shot, Our Prompt + Choices) | MC1: 0.217 |
| question-selection-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 41.4 |
| race-h-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 71.6 |
| race-m-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 75.1 |
| riddle-sense-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 68.2 |
| sarcasm-detection-on-big-bench-snarks | Gopher-280B (few-shot, k=5) | Accuracy: 48.3 |
| security-studies-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 64.9 |
| sentence-ambiguity-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 69.1 |
| similarities-abstraction-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 81.8 |
| sociology-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 84.1 |
| timedial-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 50.9 |
| triviaqa-on-big-bench | Gopher-280B (few-shot, k=64) | Accuracy: 57.1 |
| understanding-fables-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 39.6 |
| us-foreign-policy-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy : 81.0 |
| virology-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 47.0 |
| word-sense-disambiguation-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 56.4 |
| world-religions-on-big-bench | Gopher-280B (few-shot, k=5) | Accuracy: 84.2 |