Command Palette
Search for a command to run...
Ofir Zafrir Guy Boudoukh Peter Izsak Moshe Wasserblat

Abstract
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| linguistic-acceptability-on-cola | Q8BERT (Zafrir et al., 2019) | Accuracy: 65.0 |
| natural-language-inference-on-multinli | Q8BERT (Zafrir et al., 2019) | Matched: 85.6 |
| natural-language-inference-on-qnli | Q8BERT (Zafrir et al., 2019) | Accuracy: 93.0 |
| natural-language-inference-on-rte | Q8BERT (Zafrir et al., 2019) | Accuracy: 84.8 |
| semantic-textual-similarity-on-mrpc | Q8BERT (Zafrir et al., 2019) | Accuracy: 89.7 |
| semantic-textual-similarity-on-sts-benchmark | Q8BERT (Zafrir et al., 2019) | Pearson Correlation: 0.911 |
| sentiment-analysis-on-sst-2-binary | Q8BERT (Zafrir et al., 2019) | Accuracy: 94.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.