4 months ago

Adaptive Attention Span in Transformers

Sainbayar Sukhbaatar; Edouard Grave; Piotr Bojanowski; Armand Joulin

Abstract

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

Code Repositories

prajjwal1/adaptive_transformer

pytorch

Mentioned in GitHub

JoeRoussy/adaptive-attention-in-cv

pytorch

Mentioned in GitHub

jerrodparker20/adaptive-transformers-in-rl

pytorch

Mentioned in GitHub

facebookresearch/adaptive-span

Official

pytorch

Mentioned in GitHub

lancopku/Explicit-Sparse-Transformer

Mentioned in GitHub

prajjwal1/fluence

pytorch

Mentioned in GitHub

ofirpress/sandwich_transformer

pytorch

Mentioned in GitHub

pwc-1/Paper-9/tree/main/7/Knowing-When-to-Look-Adaptive-Attention

mindspore

Benchmarks

Benchmark	Methodology	Metrics
language-modelling-on-enwiki8	Transformer (12 layers, 8k adaptive span)	Bit per Character (BPC): 1.02 Number of params: 39M
language-modelling-on-enwiki8	Transformer (24 layers, 8k adaptive span)	Bit per Character (BPC): 0.98 Number of params: 209M
language-modelling-on-text8	12L Transformer + 8K adaptive span	Bit per Character (BPC): 1.11 Number of params: 38M
language-modelling-on-text8	24L Transformer + 8K adaptive span	Bit per Character (BPC): 1.07 Number of params: 209M

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Adaptive Attention Span in Transformers

Sainbayar Sukhbaatar; Edouard Grave; Piotr Bojanowski; Armand Joulin

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters