3 months ago

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Sheng Shen Zhen Dong Jiayu Ye Linjian Ma Zhewei Yao Amir Gholami Michael W. Mahoney Kurt Keutzer

Abstract

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most $2.3\%$ performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to $13\times$ compression of the model parameters, and up to $4\times$ compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

Benchmarks

Benchmark	Methodology	Metrics
linguistic-acceptability-on-cola	Q-BERT (Shen et al., 2020)	Accuracy: 65.1
natural-language-inference-on-multinli	Q-BERT (Shen et al., 2020)	Matched: 87.8
natural-language-inference-on-qnli	Q-BERT (Shen et al., 2020)	Accuracy: 93.0
natural-language-inference-on-rte	Q-BERT (Shen et al., 2020)	Accuracy: 84.7
semantic-textual-similarity-on-mrpc	Q-BERT (Shen et al., 2020)	Accuracy: 88.2
semantic-textual-similarity-on-sts-benchmark	Q-BERT (Shen et al., 2020)	Pearson Correlation: 0.911
sentiment-analysis-on-sst-2-binary	Q-BERT (Shen et al., 2020)	Accuracy: 94.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning