5 months ago

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh; Lysandre Debut; Julien Chaumond; Thomas Wolf

Abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Code Repositories

philschmid/knowledge-distillation-transformers-pytorch-sagemaker

pytorch

Mentioned in GitHub

stefan-it/europeana-bert

Mentioned in GitHub

Karthik-Bhaskar/Context-Based-Question-Answering

Mentioned in GitHub

msorkhpar/wiki-entity-summarization-preprocessor

pytorch

Mentioned in GitHub

reycn/multi-modal-scale

pytorch

Mentioned in GitHub

MindSpore-paper-code-2/code399/tree/main/Bert

mindspore

knuddj1/op_text

pytorch

Mentioned in GitHub

flexible-fl/flex-nlp

Mentioned in GitHub

askaydevs/distillbert-qa

pytorch

Mentioned in GitHub

Milan-Chicago/GLG-Automated-Meta-data-Tagging

Mentioned in GitHub

dngback/co-forget-protocol

Mentioned in GitHub

PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/distilbert/modeling.py

paddle

lukexyz/Deep-Lyrical-Genius

pytorch

Mentioned in GitHub

jaketae/pytorch-malware-detection

pytorch

Mentioned in GitHub

knuddy/op_text

pytorch

Mentioned in GitHub

sdadas/polish-roberta

pytorch

Mentioned in GitHub

tchebonenko/Automated-Topic_Modeling-and-NER

Mentioned in GitHub

monologg/distilkobert

pytorch

Mentioned in GitHub

huggingface/transformers

Official

pytorch

Mentioned in GitHub

2024-MindSpore-1/Code2/tree/main/model-1/distilbert

mindspore

MindCode-4/code-3/tree/main/distilbert

mindspore

facebookresearch/EgoTV

pytorch

Mentioned in GitHub

huggingface/tflite-android-transformers

Mentioned in GitHub

huggingface/node-question-answering

Mentioned in GitHub

frankaging/Causal-Distill

pytorch

Mentioned in GitHub

allenai/scifact

pytorch

Mentioned in GitHub

rybread1/stack-overflow-question-classification

Mentioned in GitHub

franknb/Text-Summarization

Mentioned in GitHub

epfml/collaborative-attention

pytorch

Mentioned in GitHub

twobooks/intro-aws-training

pytorch

Mentioned in GitHub

mkavim/finetune_bert

Mentioned in GitHub

suinleelab/path_explain

Mentioned in GitHub

semantic-web-company/ptlm_wsid

Mentioned in GitHub

ayeffkay/rubert-tiny

pytorch

Mentioned in GitHub

nageshsinghc4/deepwrap

Mentioned in GitHub

enzomuschik/distilfnd

pytorch

Mentioned in GitHub

huggingface/swift-coreml-transformers

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
linguistic-acceptability-on-cola	DistilBERT 66M	Accuracy: 49.1%
natural-language-inference-on-qnli	DistilBERT 66M	Accuracy: 90.2%
natural-language-inference-on-rte	DistilBERT 66M	Accuracy: 62.9%
natural-language-inference-on-wnli	DistilBERT 66M	Accuracy: 44.4
question-answering-on-multitq	DistillBERT	Hits@1: 8.3 Hits@10: 48.4
question-answering-on-quora-question-pairs	DistilBERT 66M	Accuracy: 89.2%
question-answering-on-squad11-dev	DistilBERT 66M	F1: 85.8
question-answering-on-squad11-dev	DistilBERT	EM: 77.7
semantic-textual-similarity-on-mrpc	DistilBERT 66M	Accuracy: 90.2%
semantic-textual-similarity-on-sts-benchmark	DistilBERT 66M	Pearson Correlation: 0.907
sentiment-analysis-on-imdb	DistilBERT 66M	Accuracy: 92.82
sentiment-analysis-on-sst-2-binary	DistilBERT 66M	Accuracy: 91.3
task-1-grouping-on-ocw	DistilBERT (BASE)	# Correct Groups: 49 ± 4 # Solved Walls: 0 ± 0 Adjusted Mutual Information (AMI): 14.0 ± .3 Adjusted Rand Index (ARI): 11.3 ± .3 Fowlkes Mallows Score (FMS): 29.1 ± .2 Wasserstein Distance (WD): 86.7 ± .6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh; Lysandre Debut; Julien Chaumond; Thomas Wolf

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters