Command Palette
Search for a command to run...
A benchmark for toxic comment classification on Civil Comments dataset
Corentin Duchene; Henri Jamet; Pierre Guillaume; Reda Dehak

Abstract
Toxic comment detection on social media has proven to be essential for content moderation. This paper compares a wide set of different models on a highly skewed multi-label hate speech dataset. We consider inference time and several metrics to measure performance and bias in our comparison. We show that all BERTs have similar performance regardless of the size, optimizations or language used to pre-train the models. RNNs are much faster at inference than any of the BERT. BiLSTM remains a good compromise between performance and inference time. RoBERTa with Focal Loss offers the best performance on biases and AUROC. However, DistilBERT combines both good AUROC and a low inference time. All models are affected by the bias of associating identities. BERT, RNN, and XLNet are less sensitive than the CNN and Compact Convolutional Transformers.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| toxic-comment-classification-on-civil | BiLSTM | GMB Subgroup: 0.8636 Micro F1: 0.5115 Precision: 0.3572 |
| toxic-comment-classification-on-civil | Unfreeze Glove ResNet 44 | AUROC: 0.966 GMB BPSN: 0.8493 GMB Subgroup: 0.8421 Macro F1: 0.4648 Micro F1: 0.5958 Precision: 0.4835 Recall: 0.7759 |
| toxic-comment-classification-on-civil | Compact Convolutional Transformer (CCT) | AUROC: 0.9526 GMB BNSP: 0.9447 GMB BPSN: 0.8307 GMB Subgroup: 0.8133 Macro F1: 0.3428 Micro F1: 0.4874 Precision: 0.3507 Recall: 0.7983 |
| toxic-comment-classification-on-civil | BiGRU | GMB BPSN: 0.8616 |
| toxic-comment-classification-on-civil | Freeze Glove ResNet 44 | GMB BPSN: 0.7876 GMB Subgroup: 0.8219 Macro F1: 0.4189 Micro F1: 0.5591 Precision: 0.4631 Recall: 0.7053 |
| toxic-comment-classification-on-civil | BERTweet | AUROC: 0.979 GMB BNSP: 0.9603 GMB BPSN: 0.8945 GMB Subgroup: 0.878 Macro F1: 0.3612 Micro F1: 0.4928 Precision: 0.3363 Recall: 0.9216 |
| toxic-comment-classification-on-civil | XLNet | GMB BNSP: 0.9597 GMB BPSN: 0.8834 GMB Subgroup: 0.8689 Macro F1: 0.3336 Micro F1: 0.4586 Precision: 0.3045 Recall: 0.9254 |
| toxic-comment-classification-on-civil | XLM RoBERTa | GMB BPSN: 0.8859 Micro F1: 0.468 Precision: 0.3135 Recall: 0.923 |
| toxic-comment-classification-on-civil | DistilBERT | AUROC: 0.9804 GMB BNSP: 0.9644 GMB BPSN: 0.874 GMB Subgroup: 0.8762 Macro F1: 0.3879 Micro F1: 0.5115 Precision: 0.3572 Recall: 0.9001 |
| toxic-comment-classification-on-civil | RoBERTa Focal Loss | AUROC: 0.9818 GMB BNSP: 0.9581 GMB BPSN: 0.901 GMB Subgroup: 0.8807 Macro F1: 0.4648 Micro F1: 0.5524 Precision: 0.4017 Recall: 0.8839 |
| toxic-comment-classification-on-civil | RoBERTa BCE | AUROC: 0.9813 GMB BNSP: 0.9616 GMB BPSN: 0.8901 GMB Subgroup: 0.88 Macro F1: 0.4749 Micro F1: 0.5359 Precision: 0.3836 Recall: 0.8891 |
| toxic-comment-classification-on-civil | Unfreeze Glove ResNet 56 | AUROC: 0.9639 GMB BPSN: 0.8445 GMB Subgroup: 0.8487 Macro F1: 0.3778 Recall: 0.8707 |
| toxic-comment-classification-on-civil | HateBERT | AUROC: 0.9791 GMB BNSP: 0.9589 GMB BPSN: 0.8915 GMB Subgroup: 0.8744 Macro F1: 0.3679 Micro F1: 0.4844 Precision: 0.3297 Recall: 0.9165 |
| toxic-comment-classification-on-civil | AlBERT | AUROC: 0.979 GMB BNSP: 0.9499 GMB BPSN: 0.8982 GMB Subgroup: 0.8734 Macro F1: 0.3541 Micro F1: 0.4845 Precision: 0.3247 Recall: 0.9104 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.