Command Palette
Search for a command to run...
Nikita Kitaev Łukasz Kaiser Anselm Levskaya

Abstract
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| d4rl-on-d4rl | Reformer | Average Reward: 63.9 |
| image-generation-on-imagenet-64x64 | Reformer (6 layers) | Bits per dim: 3.740 |
| image-generation-on-imagenet-64x64 | Reformer (12 layers) | Bits per dim: 3.710 |
| language-modelling-on-wikitext-103 | Reformer 125M | Test perplexity: 26.0 |
| open-domain-question-answering-on-searchqa | Locality-Sensitive Hashing | EM: 66.0 |
| question-answering-on-natural-questions-long | Locality-Sensitive Hashing | F1: 75.5 |
| question-answering-on-quasart-t | Locality-Sensitive Hashing | EM: 53.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.