
摘要
所有基于MLP(多层感知机)的架构近年来受到越来越多关注,被视为注意力机制模型的一种替代方案。在自然语言处理(NLP)领域,近期研究如gMLP表明,纯MLP模型在语言建模任务中已可与Transformer相媲美,但在下游任务上的表现仍存在差距。本文分析了MLP在表达能力方面的局限性,并提出了一种在特征维度和输入(token)维度上均采用专家混合(Mixture-of-Experts, MoE)机制的稀疏激活MLP结构。这种稀疏的全MLP架构在保持计算量不变的前提下,显著提升了模型容量与表达能力。为解决引入条件计算所面临的若干关键挑战,本文设计了两种路由策略。实验结果表明,所提出的稀疏全MLP在语言建模的困惑度(perplexity)上表现更优,并在训练效率方面相较基于Transformer的MoE模型(如GShard、Switch Transformer、Base Layers和HASH Layers),以及密集型Transformer和纯MLP模型,实现了最高达2倍的提升。最后,我们在六个下游任务上评估了该模型的零样本上下文学习(zero-shot in-context learning)性能,结果表明其表现超越了基于Transformer的MoE模型和密集型Transformer。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| common-sense-reasoning-on-record | Switch Transformer 9B | EM: 79.9 |
| common-sense-reasoning-on-record | Base Layers 10B (0-shot) | EM: 60.7 |
| common-sense-reasoning-on-record | Gshard 9B | EM: 72.4 |
| common-sense-reasoning-on-record | HASH Layers 10B (0-shot) | EM: 67.2 |
| common-sense-reasoning-on-record | sMLP – deterministic 9.4B (0-shot) | EM: 73.4 |
| common-sense-reasoning-on-winogrande | Switch Transformer 9B (0-shot) | Accuracy: 53.4 |
| common-sense-reasoning-on-winogrande | Base Layers 10B (0-shot) | Accuracy: 51 |
| common-sense-reasoning-on-winogrande | HASH Layers 10B (0-shot) | Accuracy: 51.7 |
| common-sense-reasoning-on-winogrande | sMLP – deterministic 9.4B (0-shot) | Accuracy: 54.3 |
| common-sense-reasoning-on-winogrande | Gshard 9B (0-shot) | Accuracy: 51.1 |
| question-answering-on-copa | HASH Layers 10B (0-shot) | Accuracy: 64 |
| question-answering-on-copa | Switch Transformer 9B | Accuracy: 75 |
| question-answering-on-copa | Base Layers 10B (0-shot) | Accuracy: 63 |
| question-answering-on-copa | Gshard 9B | Accuracy: 76 |
| question-answering-on-copa | sMLP – deterministic 9.4B (0-shot) | Accuracy: 79 |
| question-answering-on-piqa | HASH Layers 10B (0-shot) | Accuracy: 63.8 |
| question-answering-on-piqa | Base Layers 10B (0-shot) | Accuracy: 63.8 |
| question-answering-on-piqa | sMLP - deterministic 9.4B (0-shot) | Accuracy: 73 |
| question-answering-on-piqa | Gshard 9B | Accuracy: 68.1 |
| question-answering-on-storycloze | Switch Transformer 9B | Accuracy: 73.3 |
| question-answering-on-storycloze | Gshard 9B | Accuracy: 67.9 |
| question-answering-on-storycloze | sMLP – deterministic 9.4B (0-shot) | Accuracy: 74.7 |
| question-answering-on-storycloze | HASH Layers 10B (0-shot) | Accuracy: 64.7 |
| question-answering-on-storycloze | Base Layers 10B (0-shot) | Accuracy: 61.4 |