Command Palette
Search for a command to run...
Jack W. Rae Anna Potapenko Siddhant M. Jayakumar Timothy P. Lillicrap

Abstract
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| language-modelling-on-enwiki8 | Compressive Transformer (24 layers) | Bit per Character (BPC): 0.97 Number of params: 277M |
| language-modelling-on-hutter-prize | Compressive Transformer | Bit per Character (BPC): 0.97 |
| language-modelling-on-wikitext-103 | Compressive Transformer (18L, M=1024) | Test perplexity: 17.1 Validation perplexity: 16.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.