3 months ago

Sparsifying Transformer Models with Trainable Representation Pooling

{Anonymous}

Abstract

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator.Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling we can retain its top quality, while being $1.8 imes$ faster during training, $4.5 imes$ faster during inference and up to $13 imes$ more computationally efficient in the decoder.

Benchmarks

Benchmark	Methodology	Metrics
document-summarization-on-arxiv	DeepPyramidion	ROUGE-1: 47.15
document-summarization-on-arxiv-summarization	DeepPyramidion	Rouge-2: 19.99
text-summarization-on-arxiv	DeepPyramidion	ROUGE-1: 47.15 ROUGE-2: 19.99
text-summarization-on-arxiv	Blockwise(baseline)	ROUGE-1: 46.85 ROUGE-2: 19.39

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning