Command Palette
Search for a command to run...
Stephen Merity; Nitish Shirish Keskar; Richard Socher

Abstract
Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| language-modelling-on-penn-treebank-word | AWD-LSTM + continuous cache pointer | Params: 24M Test perplexity: 52.8 Validation perplexity: 53.9 |
| language-modelling-on-penn-treebank-word | AWD-LSTM | Params: 24M Test perplexity: 57.3 Validation perplexity: 60.0 |
| language-modelling-on-wikitext-2 | AWD-LSTM + continuous cache pointer | Number of params: 33M Test perplexity: 52.0 Validation perplexity: 53.8 |
| language-modelling-on-wikitext-2 | AWD-LSTM | Number of params: 33M Test perplexity: 65.8 Validation perplexity: 68.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.