HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multi-branch Attentive Transformer

Yang Fan Shufang Xie Yingce Xia Lijun Wu Tao Qin Xiang-Yang Li Tie-Yan Liu

Multi-branch Attentive Transformer

Abstract

While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
machine-translation-on-iwslt2014-germanMAT
BLEU score: 36.22
machine-translation-on-wmt2014-english-germanMAT
SacreBLEU: 29.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multi-branch Attentive Transformer | Papers | HyperAI