3 months ago

Efficient Language Modeling with Sparse all-MLP

Ping Yu Mikel Artetxe Myle Ott Sam Shleifer Hongyu Gong Ves Stoyanov Xian Li

Abstract

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

Benchmarks

Benchmark	Methodology	Metrics
common-sense-reasoning-on-record	Switch Transformer 9B	EM: 79.9
common-sense-reasoning-on-record	Base Layers 10B (0-shot)	EM: 60.7
common-sense-reasoning-on-record	Gshard 9B	EM: 72.4
common-sense-reasoning-on-record	HASH Layers 10B (0-shot)	EM: 67.2
common-sense-reasoning-on-record	sMLP – deterministic 9.4B (0-shot)	EM: 73.4
common-sense-reasoning-on-winogrande	Switch Transformer 9B (0-shot)	Accuracy: 53.4
common-sense-reasoning-on-winogrande	Base Layers 10B (0-shot)	Accuracy: 51
common-sense-reasoning-on-winogrande	HASH Layers 10B (0-shot)	Accuracy: 51.7
common-sense-reasoning-on-winogrande	sMLP – deterministic 9.4B (0-shot)	Accuracy: 54.3
common-sense-reasoning-on-winogrande	Gshard 9B (0-shot)	Accuracy: 51.1
question-answering-on-copa	HASH Layers 10B (0-shot)	Accuracy: 64
question-answering-on-copa	Switch Transformer 9B	Accuracy: 75
question-answering-on-copa	Base Layers 10B (0-shot)	Accuracy: 63
question-answering-on-copa	Gshard 9B	Accuracy: 76
question-answering-on-copa	sMLP – deterministic 9.4B (0-shot)	Accuracy: 79
question-answering-on-piqa	HASH Layers 10B (0-shot)	Accuracy: 63.8
question-answering-on-piqa	Base Layers 10B (0-shot)	Accuracy: 63.8
question-answering-on-piqa	sMLP - deterministic 9.4B (0-shot)	Accuracy: 73
question-answering-on-piqa	Gshard 9B	Accuracy: 68.1
question-answering-on-storycloze	Switch Transformer 9B	Accuracy: 73.3
question-answering-on-storycloze	Gshard 9B	Accuracy: 67.9
question-answering-on-storycloze	sMLP – deterministic 9.4B (0-shot)	Accuracy: 74.7
question-answering-on-storycloze	HASH Layers 10B (0-shot)	Accuracy: 64.7
question-answering-on-storycloze	Base Layers 10B (0-shot)	Accuracy: 61.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning