Command Palette
Search for a command to run...
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh; Lysandre Debut; Julien Chaumond; Thomas Wolf

Abstract
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| linguistic-acceptability-on-cola | DistilBERT 66M | Accuracy: 49.1% |
| natural-language-inference-on-qnli | DistilBERT 66M | Accuracy: 90.2% |
| natural-language-inference-on-rte | DistilBERT 66M | Accuracy: 62.9% |
| natural-language-inference-on-wnli | DistilBERT 66M | Accuracy: 44.4 |
| question-answering-on-multitq | DistillBERT | Hits@1: 8.3 Hits@10: 48.4 |
| question-answering-on-quora-question-pairs | DistilBERT 66M | Accuracy: 89.2% |
| question-answering-on-squad11-dev | DistilBERT 66M | F1: 85.8 |
| question-answering-on-squad11-dev | DistilBERT | EM: 77.7 |
| semantic-textual-similarity-on-mrpc | DistilBERT 66M | Accuracy: 90.2% |
| semantic-textual-similarity-on-sts-benchmark | DistilBERT 66M | Pearson Correlation: 0.907 |
| sentiment-analysis-on-imdb | DistilBERT 66M | Accuracy: 92.82 |
| sentiment-analysis-on-sst-2-binary | DistilBERT 66M | Accuracy: 91.3 |
| task-1-grouping-on-ocw | DistilBERT (BASE) | # Correct Groups: 49 ± 4 # Solved Walls: 0 ± 0 Adjusted Mutual Information (AMI): 14.0 ± .3 Adjusted Rand Index (ARI): 11.3 ± .3 Fowlkes Mallows Score (FMS): 29.1 ± .2 Wasserstein Distance (WD): 86.7 ± .6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.