HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Pay Attention when Required

Swetha Mandava Szymon Migacz Alex Fit Florea

Pay Attention when Required

Abstract

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

Benchmarks

BenchmarkMethodologyMetrics
language-modelling-on-enwiki8-1PAR Transformer 24B
Bit per Character (BPC): 1.11
language-modelling-on-text8PAR Transformer 24B
Bit per Character (BPC): 1.18
language-modelling-on-wikitext-103PAR Transformer Base
Test perplexity: 22.7
language-modelling-on-wikitext-103PAR Transformer Large
Test perplexity: 18.4
sentiment-analysis-on-sst-2-binaryPAR BERT Base
Accuracy: 91.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Pay Attention when Required | Papers | HyperAI