
摘要
现有的预训练模型通常针对某一类问题进行设计。迄今为止,关于正确的架构和预训练设置尚未达成共识。本文提出了一种统一的框架,旨在使预训练模型在不同数据集和设置下均能表现出普遍的有效性。我们首先将架构原型与预训练目标这两个常被混淆的概念区分开来。接下来,我们从自然语言处理(NLP)的角度提出了一个广义且统一的自我监督视角,展示了不同的预训练目标如何可以相互转换,以及在不同目标之间插值的有效性。随后,我们提出了混合去噪器(Mixture-of-Denoisers, MoD)这一预训练目标,该目标结合了多种预训练范式。此外,我们引入了模式切换的概念,即下游微调与特定的预训练方案相关联。我们进行了广泛的消融实验,比较了多种预训练目标,并发现我们的方法通过在多个不同的设置中超越T5和GPT类模型而推进了帕累托前沿。通过将模型扩展到200亿参数规模,我们在50个基于监督微调的知名自然语言处理任务上实现了最先进的性能。我们的模型在上下文学习方面也取得了优异的成绩,在零样本SuperGLUE任务上超过了1750亿参数的GPT-3,并在一例摘要任务上的表现是T5-XXL的三倍。在零样本MMLU任务上,UL2 20B优于T0和T5模型。UL2 20B还在链式思维提示和推理方面表现出色,使其成为研究小至中等规模(200亿参数)推理问题的理想选择。最后,我们将FLAN指令调优应用于UL2 20B模型,在MMLU和Big-Bench评分上达到了与FLAN-PaLM 62B相当的竞争水平。我们发布了基于Flax的T5X检查点,包括UL2 20B和Flan-UL2 20B。关键词:预训练模型、架构原型、预训练目标、自我监督、混合去噪器(MoD)、模式切换、帕累托前沿、自然语言处理(NLP)、上下文学习、SuperGLUE、MMLU、链式思维提示、推理、FLAN指令调优、Flax、T5X检查点
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | UL2 20B (chain-of-thought) | Accuracy: 4.4 Parameters (Billion): 20 |
| arithmetic-reasoning-on-gsm8k | UL2 20B (0-shot) | Accuracy: 4.1 Parameters (Billion): 20 |
| common-sense-reasoning-on-arc-challenge | UL2 20B (chain-of-thought + self-consistency) | Accuracy: 49.5 |
| common-sense-reasoning-on-arc-challenge | UL2 20B (zero-shot) | Accuracy: 29.8 |
| common-sense-reasoning-on-arc-challenge | UL2 20B (chain-of-thought) | Accuracy: 42.9 |
| common-sense-reasoning-on-arc-easy | UL2 20B (0-shot) | Accuracy: 32.2 |
| common-sense-reasoning-on-arc-easy | UL2 20B (chain-of-thought + self-consistency) | Accuracy: 69.8 |
| common-sense-reasoning-on-arc-easy | UL2 20B (chain-of-thought) | Accuracy: 38.4 |
| common-sense-reasoning-on-commonsenseqa | UL2 20B (chain-of-thought) | Accuracy: 51.4 |
| common-sense-reasoning-on-commonsenseqa | UL2 20B (zero-shot) | Accuracy: 34.2 |
| common-sense-reasoning-on-commonsenseqa | UL2 20B (chain-of-thought + self-consistency) | Accuracy: 55.7 |
| coreference-resolution-on-winograd-schema | UL2 20B (fine-tuned) | Accuracy: 98.1 |
| coreference-resolution-on-winograd-schema | UL2 20B (0-shot) | Accuracy: 79.9 |
| long-range-modeling-on-scrolls | UL2 20B | CNLI: 88.7 |
| long-range-modeling-on-scrolls | UL2 | Avg.: 37.87 GovRep: 53.6 / 26.1 / 28.8 Nrtv: 24.2 QALT EM-T/H: 45.8 / 40.7 QMSum: 31.1 / 8.5 / 20.4 Qspr: 37.6 SumScr: 32.9 / 7.8 / 19.4 |
| multi-task-language-understanding-on-mmlu | UL2 20B (5-shot) | Average (%): 39.2 |
| natural-language-inference-on-rte | UL2 20B (0-shot) | Accuracy: 60.7% |
| natural-language-inference-on-rte | UL2 20B (fine-tuned) | Accuracy: 92.1% |
| question-answering-on-boolq | UL2 20B (0-shot) | Accuracy: 63.1 |
| question-answering-on-boolq | UL2 20B (fine-tuned) | Accuracy: 90.8 |
| question-answering-on-copa | UL2 20B (0-shot) | Accuracy: 85 |
| question-answering-on-copa | UL2 20B (fine-tuned) | Accuracy: 99 |
| word-sense-disambiguation-on-words-in-context | UL2 20B (fine-tuned) | Accuracy: 77.3 |
| word-sense-disambiguation-on-words-in-context | UL2 20B (0-shot) | Accuracy: 49.8 |