5 months ago

UL2: Unifying Language Learning Paradigms

Yi Tay; Mostafa Dehghani; Vinh Q. Tran; Xavier Garcia; Jason Wei; Xuezhi Wang; Hyung Won Chung; Siamak Shakeri; Dara Bahri; Tal Schuster; Huaixiu Steven Zheng; Denny Zhou; Neil Houlsby; Donald Metzler

Abstract

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.

Code Repositories

opennlg/openba-v2

pytorch

Mentioned in GitHub

google-research/google-research

Official

Benchmarks

Benchmark	Methodology	Metrics
arithmetic-reasoning-on-gsm8k	UL2 20B (chain-of-thought)	Accuracy: 4.4 Parameters (Billion): 20
arithmetic-reasoning-on-gsm8k	UL2 20B (0-shot)	Accuracy: 4.1 Parameters (Billion): 20
common-sense-reasoning-on-arc-challenge	UL2 20B (chain-of-thought + self-consistency)	Accuracy: 49.5
common-sense-reasoning-on-arc-challenge	UL2 20B (zero-shot)	Accuracy: 29.8
common-sense-reasoning-on-arc-challenge	UL2 20B (chain-of-thought)	Accuracy: 42.9
common-sense-reasoning-on-arc-easy	UL2 20B (0-shot)	Accuracy: 32.2
common-sense-reasoning-on-arc-easy	UL2 20B (chain-of-thought + self-consistency)	Accuracy: 69.8
common-sense-reasoning-on-arc-easy	UL2 20B (chain-of-thought)	Accuracy: 38.4
common-sense-reasoning-on-commonsenseqa	UL2 20B (chain-of-thought)	Accuracy: 51.4
common-sense-reasoning-on-commonsenseqa	UL2 20B (zero-shot)	Accuracy: 34.2
common-sense-reasoning-on-commonsenseqa	UL2 20B (chain-of-thought + self-consistency)	Accuracy: 55.7
coreference-resolution-on-winograd-schema	UL2 20B (fine-tuned)	Accuracy: 98.1
coreference-resolution-on-winograd-schema	UL2 20B (0-shot)	Accuracy: 79.9
long-range-modeling-on-scrolls	UL2 20B	CNLI: 88.7
long-range-modeling-on-scrolls	UL2	Avg.: 37.87 GovRep: 53.6 / 26.1 / 28.8 Nrtv: 24.2 QALT EM-T/H: 45.8 / 40.7 QMSum: 31.1 / 8.5 / 20.4 Qspr: 37.6 SumScr: 32.9 / 7.8 / 19.4
multi-task-language-understanding-on-mmlu	UL2 20B (5-shot)	Average (%): 39.2
natural-language-inference-on-rte	UL2 20B (0-shot)	Accuracy: 60.7%
natural-language-inference-on-rte	UL2 20B (fine-tuned)	Accuracy: 92.1%
question-answering-on-boolq	UL2 20B (0-shot)	Accuracy: 63.1
question-answering-on-boolq	UL2 20B (fine-tuned)	Accuracy: 90.8
question-answering-on-copa	UL2 20B (0-shot)	Accuracy: 85
question-answering-on-copa	UL2 20B (fine-tuned)	Accuracy: 99
word-sense-disambiguation-on-words-in-context	UL2 20B (fine-tuned)	Accuracy: 77.3
word-sense-disambiguation-on-words-in-context	UL2 20B (0-shot)	Accuracy: 49.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette