4 months ago

Sequence-Level Knowledge Distillation

Yoon Kim; Alexander M. Rush

Abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

Code Repositories

harvardnlp/nmt-android

Official

pytorch

Mentioned in GitHub

harvardnlp/seq2seq-attn

Official

pytorch

Mentioned in GitHub

xuanlinli17/autoregressive_inference

Mentioned in GitHub

anonymouscode115/autoregressive_inference

Mentioned in GitHub

ictnlp/Seq-NAT

pytorch

Mentioned in GitHub

facebookresearch/stopes

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
machine-translation-on-iwslt2015-thai-english	Seq-KD + Seq-Inter + Word-KD	BLEU score: 14.2
machine-translation-on-wmt2014-english-german	Seq-KD + Seq-Inter + Word-KD	BLEU score: 18.5 Hardware Burden: Operations per network pass:

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Sequence-Level Knowledge Distillation

Yoon Kim; Alexander M. Rush

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters