Command Palette
Search for a command to run...
Sheng Shen Zhen Dong Jiayu Ye Linjian Ma Zhewei Yao Amir Gholami Michael W. Mahoney Kurt Keutzer

Abstract
Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most $2.3\%$ performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to $13\times$ compression of the model parameters, and up to $4\times$ compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| linguistic-acceptability-on-cola | Q-BERT (Shen et al., 2020) | Accuracy: 65.1 |
| natural-language-inference-on-multinli | Q-BERT (Shen et al., 2020) | Matched: 87.8 |
| natural-language-inference-on-qnli | Q-BERT (Shen et al., 2020) | Accuracy: 93.0 |
| natural-language-inference-on-rte | Q-BERT (Shen et al., 2020) | Accuracy: 84.7 |
| semantic-textual-similarity-on-mrpc | Q-BERT (Shen et al., 2020) | Accuracy: 88.2 |
| semantic-textual-similarity-on-sts-benchmark | Q-BERT (Shen et al., 2020) | Pearson Correlation: 0.911 |
| sentiment-analysis-on-sst-2-binary | Q-BERT (Shen et al., 2020) | Accuracy: 94.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.