HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Efficient Language Modeling with Sparse all-MLP

Ping Yu Mikel Artetxe Myle Ott Sam Shleifer Hongyu Gong Ves Stoyanov Xian Li

Efficient Language Modeling with Sparse all-MLP

Abstract

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

Benchmarks

BenchmarkMethodologyMetrics
common-sense-reasoning-on-recordSwitch Transformer 9B
EM: 79.9
common-sense-reasoning-on-recordBase Layers 10B (0-shot)
EM: 60.7
common-sense-reasoning-on-recordGshard 9B
EM: 72.4
common-sense-reasoning-on-recordHASH Layers 10B (0-shot)
EM: 67.2
common-sense-reasoning-on-recordsMLP – deterministic 9.4B (0-shot)
EM: 73.4
common-sense-reasoning-on-winograndeSwitch Transformer 9B (0-shot)
Accuracy: 53.4
common-sense-reasoning-on-winograndeBase Layers 10B (0-shot)
Accuracy: 51
common-sense-reasoning-on-winograndeHASH Layers 10B (0-shot)
Accuracy: 51.7
common-sense-reasoning-on-winograndesMLP – deterministic 9.4B (0-shot)
Accuracy: 54.3
common-sense-reasoning-on-winograndeGshard 9B (0-shot)
Accuracy: 51.1
question-answering-on-copaHASH Layers 10B (0-shot)
Accuracy: 64
question-answering-on-copaSwitch Transformer 9B
Accuracy: 75
question-answering-on-copaBase Layers 10B (0-shot)
Accuracy: 63
question-answering-on-copaGshard 9B
Accuracy: 76
question-answering-on-copasMLP – deterministic 9.4B (0-shot)
Accuracy: 79
question-answering-on-piqaHASH Layers 10B (0-shot)
Accuracy: 63.8
question-answering-on-piqaBase Layers 10B (0-shot)
Accuracy: 63.8
question-answering-on-piqasMLP - deterministic 9.4B (0-shot)
Accuracy: 73
question-answering-on-piqaGshard 9B
Accuracy: 68.1
question-answering-on-storyclozeSwitch Transformer 9B
Accuracy: 73.3
question-answering-on-storyclozeGshard 9B
Accuracy: 67.9
question-answering-on-storyclozesMLP – deterministic 9.4B (0-shot)
Accuracy: 74.7
question-answering-on-storyclozeHASH Layers 10B (0-shot)
Accuracy: 64.7
question-answering-on-storyclozeBase Layers 10B (0-shot)
Accuracy: 61.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Efficient Language Modeling with Sparse all-MLP | Papers | HyperAI