5 months ago

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery; Sharan Narang; Jacob Devlin; Maarten Bosma; Gaurav Mishra; Adam Roberts; Paul Barham; Hyung Won Chung; Charles Sutton; Sebastian Gehrmann; Parker Schuh; Kensen Shi; Sasha Tsvyashchenko; Joshua Maynez; Abhishek Rao; Parker Barnes; Yi Tay; Noam Shazeer; Vinodkumar Prabhakaran; Emily Reif; Nan Du; Ben Hutchinson; Reiner Pope; James Bradbury; Jacob Austin; Michael Isard; Guy Gur-Ari; Pengcheng Yin; Toju Duke; Anselm Levskaya; Sanjay Ghemawat; Sunipa Dev; Henryk Michalewski; Xavier Garcia; Vedant Misra; Kevin Robinson; Liam Fedus; Denny Zhou; Daphne Ippolito; David Luan; Hyeontaek Lim; Barret Zoph; Alexander Spiridonov; Ryan Sepassi; David Dohan; Shivani Agrawal; Mark Omernick; Andrew M. Dai; Thanumalayan Sankaranarayana Pillai; Marie Pellat; Aitor Lewkowycz; Erica Moreira; Rewon Child; Oleksandr Polozov; Katherine Lee; Zongwei Zhou; Xuezhi Wang; Brennan Saeta; Mark Diaz; Orhan Firat; Michele Catasta; Jason Wei; Kathy Meier-Hellstern; Douglas Eck; Jeff Dean; Slav Petrov; Noah Fiedel

Abstract

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Code Repositories

chrisociepa/allamo

pytorch

Mentioned in GitHub

foundation-model-stack/fms-fsdp

pytorch

Mentioned in GitHub

google/paxml

jax

Mentioned in GitHub

lucidrains/PaLM-pytorch

pytorch

lucidrains/PaLM-jax

jax

conceptofmind/PaLM-flax

jax

lucidrains/CoCa-pytorch

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
auto-debugging-on-big-bench-lite	PaLM 62B (few-shot, k=5)	Exact string match: 38.2
auto-debugging-on-big-bench-lite	PaLM 8B (few-shot, k=5)	Exact string match: 14.7
auto-debugging-on-big-bench-lite	PaLM 540B (few-shot, k=5)	Exact string match: 38.2
code-generation-on-mbpp	PaLM Coder 540B	Accuracy: 47
code-generation-on-mbpp	PaLM 540B	Accuracy: 36.8
common-sense-reasoning-on-big-bench-known	PaLM-540B (few-shot, k=5)	Accuracy: 73.9
common-sense-reasoning-on-big-bench-winowhy	PaLM-62B (few-shot, k=5)	Accuracy: 61.0
common-sense-reasoning-on-big-bench-winowhy	PaLM-540B (few-shot, k=5)	Accuracy: 65.9
common-sense-reasoning-on-record	PaLM 540B (finetuned)	EM: 94.0 F1: 94.6
common-sense-reasoning-on-winogrande	PaLM 62B (0-shot)	Accuracy: 77.0
common-sense-reasoning-on-winogrande	PaLM 540B (0-shot)	Accuracy: 81.1
common-sense-reasoning-on-winogrande	PaLM-cont 62B (0-shot)	Accuracy: 77.0
coreference-resolution-on-winograd-schema	PaLM 540B (1-shot)	Accuracy: 86.3
coreference-resolution-on-winograd-schema	PaLM 540B (0-shot)	Accuracy: 89.1
coreference-resolution-on-winograd-schema	PaLM 540B (fine-tuned)	Accuracy: 100
coreference-resolution-on-winograd-schema	PaLM 540B (5-shot)	Accuracy: 89.5
cross-lingual-question-answering-on-tydiqa	PaLM-540B (CoT)	EM: 52.9
extreme-summarization-on-gem-xsum	PaLM (finetuning)-540B	Parameters: 540 B ROUGE-2: 21.2
extreme-summarization-on-gem-xsum	T5-XXL	ROUGE-2: 21.0
extreme-summarization-on-gem-xsum	PaLM (finetuning)-62B	Parameters: 62 B ROUGE-2: 18.5
language-modelling-on-lambada	PaLM-540B (Zero-Shot)	Accuracy: 77.9
language-modelling-on-lambada	PaLM-540B (Few-Shot)	Accuracy: 89.7
language-modelling-on-lambada	PaLM-540B (One-Shot)	Accuracy: 81.8
logical-reasoning-on-big-bench-strategyqa	PaLM-62B (few-shot, k=5)	Accuracy: 65.4
logical-reasoning-on-big-bench-strategyqa	PaLM-540B (few-shot, k=5)	Accuracy: 73.9
memorization-on-big-bench-hindu-knowledge	PaLM-540B (few-shot, k=5)	Accuracy: 95.4
memorization-on-big-bench-hindu-knowledge	PaLM-62B (few-shot, k=5)	Accuracy: 77.7
multi-task-language-understanding-on-mgsm	PaLM 540B	Average (%): 55.0
multiple-choice-question-answering-mcqa-on-31	PaLM-62B (few-shot, k=5)	Accuracy: 59.4
multiple-choice-question-answering-mcqa-on-31	PaLM-540B (few-shot, k=5)	Accuracy: 71.9
natural-language-inference-on-commitmentbank	PaLM 540B (finetuned)	Accuracy: 100 F1: 100
natural-language-inference-on-rte	PaLM 540B (1-shot)	Accuracy: 78.7%
natural-language-inference-on-rte	PaLM 540B (0-shot)	Accuracy: 72.9%
natural-language-inference-on-rte	PaLM 540B (5-shot)	Accuracy: 79.6%
natural-language-inference-on-rte	PaLM 540B (fine-tuned)	Accuracy: 95.7%
question-answering-on-boolq	PaLM 540B (fine-tuned)	Accuracy: 92.2
question-answering-on-copa	PaLM 540B (finetuned)	Accuracy: 100
question-answering-on-multirc	PaLM 540B (finetuned)	EM: 69.2 F1: 90.1
question-answering-on-natural-questions	PaLM-540B (Zero-Shot)	EM: 21.2
question-answering-on-natural-questions	PaLM-540B (One-Shot)	EM: 29.3
question-answering-on-natural-questions	PaLM-540B (Few-Shot, k=64)	EM: 39.6
question-answering-on-obqa	PaLM 540B (zero-shot)	Accuracy: 53.4
question-answering-on-obqa	PaLM 62B (zero-shot)	Accuracy: 50.4
question-answering-on-triviaqa	PaLM-540B (Zero-Shot)	EM: 76.9
question-answering-on-triviaqa	PaLM-540B (One-Shot)	EM: 81.4
question-answering-on-triviaqa	PaLM-540B (Few-Shot)	EM: 81.4
question-answering-on-webquestions	PaLM-540B (Zero-Shot)	EM: 10.6
question-answering-on-webquestions	PaLM-540B (One-Shot)	EM: 22.6
question-answering-on-webquestions	PaLM-540B (Few-Shot)	EM: 43.5
reading-comprehension-on-race	PaLM 8B (zero-shot)	Accuracy (High): 42.3 Accuracy (Middle): 57.9
reading-comprehension-on-race	PaLM 540B (zero-shot)	Accuracy (High): 49.1 Accuracy (Middle): 68.1
reading-comprehension-on-race	PaLM 62B (zero-shot)	Accuracy (High): 47.5 Accuracy (Middle): 64.3
word-sense-disambiguation-on-words-in-context	PaLM 540B (finetuned)	Accuracy: 78.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

PaLM: Scaling Language Modeling with Pathways

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters