Command Palette
Search for a command to run...
Yi Tay; Mostafa Dehghani; Vinh Q. Tran; Xavier Garcia; Jason Wei; Xuezhi Wang; Hyung Won Chung; Siamak Shakeri; Dara Bahri; Tal Schuster; Huaixiu Steven Zheng; Denny Zhou; Neil Houlsby; Donald Metzler

Abstract
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | UL2 20B (chain-of-thought) | Accuracy: 4.4 Parameters (Billion): 20 |
| arithmetic-reasoning-on-gsm8k | UL2 20B (0-shot) | Accuracy: 4.1 Parameters (Billion): 20 |
| common-sense-reasoning-on-arc-challenge | UL2 20B (chain-of-thought + self-consistency) | Accuracy: 49.5 |
| common-sense-reasoning-on-arc-challenge | UL2 20B (zero-shot) | Accuracy: 29.8 |
| common-sense-reasoning-on-arc-challenge | UL2 20B (chain-of-thought) | Accuracy: 42.9 |
| common-sense-reasoning-on-arc-easy | UL2 20B (0-shot) | Accuracy: 32.2 |
| common-sense-reasoning-on-arc-easy | UL2 20B (chain-of-thought + self-consistency) | Accuracy: 69.8 |
| common-sense-reasoning-on-arc-easy | UL2 20B (chain-of-thought) | Accuracy: 38.4 |
| common-sense-reasoning-on-commonsenseqa | UL2 20B (chain-of-thought) | Accuracy: 51.4 |
| common-sense-reasoning-on-commonsenseqa | UL2 20B (zero-shot) | Accuracy: 34.2 |
| common-sense-reasoning-on-commonsenseqa | UL2 20B (chain-of-thought + self-consistency) | Accuracy: 55.7 |
| coreference-resolution-on-winograd-schema | UL2 20B (fine-tuned) | Accuracy: 98.1 |
| coreference-resolution-on-winograd-schema | UL2 20B (0-shot) | Accuracy: 79.9 |
| long-range-modeling-on-scrolls | UL2 20B | CNLI: 88.7 |
| long-range-modeling-on-scrolls | UL2 | Avg.: 37.87 GovRep: 53.6 / 26.1 / 28.8 Nrtv: 24.2 QALT EM-T/H: 45.8 / 40.7 QMSum: 31.1 / 8.5 / 20.4 Qspr: 37.6 SumScr: 32.9 / 7.8 / 19.4 |
| multi-task-language-understanding-on-mmlu | UL2 20B (5-shot) | Average (%): 39.2 |
| natural-language-inference-on-rte | UL2 20B (0-shot) | Accuracy: 60.7% |
| natural-language-inference-on-rte | UL2 20B (fine-tuned) | Accuracy: 92.1% |
| question-answering-on-boolq | UL2 20B (0-shot) | Accuracy: 63.1 |
| question-answering-on-boolq | UL2 20B (fine-tuned) | Accuracy: 90.8 |
| question-answering-on-copa | UL2 20B (0-shot) | Accuracy: 85 |
| question-answering-on-copa | UL2 20B (fine-tuned) | Accuracy: 99 |
| word-sense-disambiguation-on-words-in-context | UL2 20B (fine-tuned) | Accuracy: 77.3 |
| word-sense-disambiguation-on-words-in-context | UL2 20B (0-shot) | Accuracy: 49.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.