
摘要
大型语言模型在少样本学习中已展现出卓越的性能,能够显著减少适应特定应用所需的任务特定训练样本数量。为了进一步研究规模对少样本学习的影响,我们训练了一个拥有5400亿参数、密集激活的Transformer语言模型,命名为Pathways语言模型(PaLM)。我们利用Pathways这一新的机器学习系统,在6144个TPU v4芯片上对PaLM进行了训练,该系统能够在多个TPU Pod之间实现高效的训练。通过在数百个语言理解和生成基准测试中取得最先进的少样本学习结果,我们展示了继续扩展模型规模所带来的好处。在这些任务中的许多任务上,PaLM 540B实现了突破性的性能,超越了经过微调的最先进模型在一系列多步推理任务上的表现,并且在最近发布的BIG-bench基准测试中超过了普通人类的表现。大量的BIG-bench任务显示了随着模型规模扩大而带来的非连续性改进,这意味着性能在我们扩展到最大模型时急剧提升。此外,PaLM在多语言任务和源代码生成方面也表现出强大的能力,这一点我们在多种基准测试中得到了验证。我们还提供了关于偏见和毒性的全面分析,并研究了不同模型规模下的训练数据记忆程度。最后,我们讨论了与大型语言模型相关的伦理问题,并探讨了潜在的缓解策略。
代码仓库
chrisociepa/allamo
pytorch
GitHub 中提及
foundation-model-stack/fms-fsdp
pytorch
GitHub 中提及
google/paxml
jax
GitHub 中提及
lucidrains/PaLM-pytorch
pytorch
lucidrains/CoCa-pytorch
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| auto-debugging-on-big-bench-lite | PaLM 62B (few-shot, k=5) | Exact string match: 38.2 |
| auto-debugging-on-big-bench-lite | PaLM 8B (few-shot, k=5) | Exact string match: 14.7 |
| auto-debugging-on-big-bench-lite | PaLM 540B (few-shot, k=5) | Exact string match: 38.2 |
| code-generation-on-mbpp | PaLM Coder 540B | Accuracy: 47 |
| code-generation-on-mbpp | PaLM 540B | Accuracy: 36.8 |
| common-sense-reasoning-on-big-bench-known | PaLM-540B (few-shot, k=5) | Accuracy: 73.9 |
| common-sense-reasoning-on-big-bench-winowhy | PaLM-62B (few-shot, k=5) | Accuracy: 61.0 |
| common-sense-reasoning-on-big-bench-winowhy | PaLM-540B (few-shot, k=5) | Accuracy: 65.9 |
| common-sense-reasoning-on-record | PaLM 540B (finetuned) | EM: 94.0 F1: 94.6 |
| common-sense-reasoning-on-winogrande | PaLM 62B (0-shot) | Accuracy: 77.0 |
| common-sense-reasoning-on-winogrande | PaLM 540B (0-shot) | Accuracy: 81.1 |
| common-sense-reasoning-on-winogrande | PaLM-cont 62B (0-shot) | Accuracy: 77.0 |
| coreference-resolution-on-winograd-schema | PaLM 540B (1-shot) | Accuracy: 86.3 |
| coreference-resolution-on-winograd-schema | PaLM 540B (0-shot) | Accuracy: 89.1 |
| coreference-resolution-on-winograd-schema | PaLM 540B (fine-tuned) | Accuracy: 100 |
| coreference-resolution-on-winograd-schema | PaLM 540B (5-shot) | Accuracy: 89.5 |
| cross-lingual-question-answering-on-tydiqa | PaLM-540B (CoT) | EM: 52.9 |
| extreme-summarization-on-gem-xsum | PaLM (finetuning)-540B | Parameters: 540 B ROUGE-2: 21.2 |
| extreme-summarization-on-gem-xsum | T5-XXL | ROUGE-2: 21.0 |
| extreme-summarization-on-gem-xsum | PaLM (finetuning)-62B | Parameters: 62 B ROUGE-2: 18.5 |
| language-modelling-on-lambada | PaLM-540B (Zero-Shot) | Accuracy: 77.9 |
| language-modelling-on-lambada | PaLM-540B (Few-Shot) | Accuracy: 89.7 |
| language-modelling-on-lambada | PaLM-540B (One-Shot) | Accuracy: 81.8 |
| logical-reasoning-on-big-bench-strategyqa | PaLM-62B (few-shot, k=5) | Accuracy: 65.4 |
| logical-reasoning-on-big-bench-strategyqa | PaLM-540B (few-shot, k=5) | Accuracy: 73.9 |
| memorization-on-big-bench-hindu-knowledge | PaLM-540B (few-shot, k=5) | Accuracy: 95.4 |
| memorization-on-big-bench-hindu-knowledge | PaLM-62B (few-shot, k=5) | Accuracy: 77.7 |
| multi-task-language-understanding-on-mgsm | PaLM 540B | Average (%): 55.0 |
| multiple-choice-question-answering-mcqa-on-31 | PaLM-62B (few-shot, k=5) | Accuracy: 59.4 |
| multiple-choice-question-answering-mcqa-on-31 | PaLM-540B (few-shot, k=5) | Accuracy: 71.9 |
| natural-language-inference-on-commitmentbank | PaLM 540B (finetuned) | Accuracy: 100 F1: 100 |
| natural-language-inference-on-rte | PaLM 540B (1-shot) | Accuracy: 78.7% |
| natural-language-inference-on-rte | PaLM 540B (0-shot) | Accuracy: 72.9% |
| natural-language-inference-on-rte | PaLM 540B (5-shot) | Accuracy: 79.6% |
| natural-language-inference-on-rte | PaLM 540B (fine-tuned) | Accuracy: 95.7% |
| question-answering-on-boolq | PaLM 540B (fine-tuned) | Accuracy: 92.2 |
| question-answering-on-copa | PaLM 540B (finetuned) | Accuracy: 100 |
| question-answering-on-multirc | PaLM 540B (finetuned) | EM: 69.2 F1: 90.1 |
| question-answering-on-natural-questions | PaLM-540B (Zero-Shot) | EM: 21.2 |
| question-answering-on-natural-questions | PaLM-540B (One-Shot) | EM: 29.3 |
| question-answering-on-natural-questions | PaLM-540B (Few-Shot, k=64) | EM: 39.6 |
| question-answering-on-obqa | PaLM 540B (zero-shot) | Accuracy: 53.4 |
| question-answering-on-obqa | PaLM 62B (zero-shot) | Accuracy: 50.4 |
| question-answering-on-triviaqa | PaLM-540B (Zero-Shot) | EM: 76.9 |
| question-answering-on-triviaqa | PaLM-540B (One-Shot) | EM: 81.4 |
| question-answering-on-triviaqa | PaLM-540B (Few-Shot) | EM: 81.4 |
| question-answering-on-webquestions | PaLM-540B (Zero-Shot) | EM: 10.6 |
| question-answering-on-webquestions | PaLM-540B (One-Shot) | EM: 22.6 |
| question-answering-on-webquestions | PaLM-540B (Few-Shot) | EM: 43.5 |
| reading-comprehension-on-race | PaLM 8B (zero-shot) | Accuracy (High): 42.3 Accuracy (Middle): 57.9 |
| reading-comprehension-on-race | PaLM 540B (zero-shot) | Accuracy (High): 49.1 Accuracy (Middle): 68.1 |
| reading-comprehension-on-race | PaLM 62B (zero-shot) | Accuracy (High): 47.5 Accuracy (Middle): 64.3 |
| word-sense-disambiguation-on-words-in-context | PaLM 540B (finetuned) | Accuracy: 78.8 |