
摘要
语言模型预训练已带来显著的性能提升,但不同方法之间的仔细比较颇具挑战性。训练过程计算成本高昂,通常在不同规模的私有数据集上进行,正如我们将展示的那样,超参数选择对最终结果有着重大影响。本文对BERT预训练(Devlin等人,2019年)进行了复制研究,仔细测量了多个关键超参数和训练数据量的影响。我们发现,BERT的训练明显不足,且其性能可以匹敌甚至超过所有在其之后发布的模型。我们的最佳模型在GLUE、RACE和SQuAD基准测试中取得了最先进的结果。这些结果突显了先前被忽视的设计选择的重要性,并对近期报告的改进来源提出了质疑。我们发布了我们的模型和代码。
代码仓库
hkuds/easyrec
pytorch
GitHub 中提及
SindhuMadi/FakeNewsDetection
GitHub 中提及
expertailab/spaceqa
pytorch
GitHub 中提及
Karthik-Bhaskar/Context-Based-Question-Answering
tf
GitHub 中提及
lvyufeng/bert4ms
mindspore
awslabs/mlm-scoring
mxnet
GitHub 中提及
haisongzhang/roberta-tiny-cased
GitHub 中提及
common-english/bert-all
pytorch
GitHub 中提及
obi-ml-public/ehr_deidentification
GitHub 中提及
pytorch/fairseq
官方
pytorch
benywon/ReCO
pytorch
GitHub 中提及
bluejurand/Kaggle_QA_Google_Labeling
tf
GitHub 中提及
UnknownGenie/altered-BERT-KPE
pytorch
GitHub 中提及
knuddj1/op_text
pytorch
GitHub 中提及
xiaoqian19940510/text-classification-surveys
pytorch
GitHub 中提及
znhy1024/protoco
pytorch
GitHub 中提及
CalumPerrio/WNUT-2020
pytorch
GitHub 中提及
zfj1998/CodeBert-Code2Text
pytorch
GitHub 中提及
simon-benigeri/narrative-generation
pytorch
GitHub 中提及
dig-team/hanna-benchmark-asg
pytorch
GitHub 中提及
flexible-fl/flex-nlp
GitHub 中提及
tighu20/Kaggle-Tweet-Sentiment-Extraction
tf
GitHub 中提及
salesforce/codet5
pytorch
GitHub 中提及
ricaelum42/Contextual-Twitter-Sarcasm-Detection
pytorch
GitHub 中提及
musixmatchresearch/umberto
pytorch
GitHub 中提及
facebookresearch/anli
pytorch
GitHub 中提及
GeorgeLuImmortal/Hierarchical-BERT-Model-with-Limited-Labelled-Data
pytorch
GitHub 中提及
nguyenvulebinh/vietnamese-roberta
pytorch
GitHub 中提及
viethoang1512/kpa
pytorch
GitHub 中提及
knuddy/op_text
pytorch
GitHub 中提及
wzzzd/LM_NER
pytorch
GitHub 中提及
sdadas/polish-roberta
pytorch
GitHub 中提及
duanchi1230/NLP_Project_AI2_Reasoning_Challenge
pytorch
GitHub 中提及
Tencent/TurboTransformers
pytorch
GitHub 中提及
abdumaa/hiqualprop
pytorch
GitHub 中提及
devhemza/BERTweet_sentiment_analysis
pytorch
GitHub 中提及
eternityyw/tram-benchmark
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
abhishekanand1710/noiseandbias
GitHub 中提及
oneflow-inc/libai
GitHub 中提及
bfopengradient/NLP_ROBERTA
GitHub 中提及
clovaai/textual-kd-slu
pytorch
GitHub 中提及
xiaoqian19940510/text-classification-
pytorch
GitHub 中提及
aistairc/kirt_bert_on_abci
pytorch
GitHub 中提及
2023-MindSpore-1/ms-code-163
mindspore
pisalore/roberta_results
pytorch
GitHub 中提及
bcaitech1/p2-klue-Heeseok-Jeong
pytorch
GitHub 中提及
G-4-R-Y/Tweet-Sentiment-Extraction
GitHub 中提及
mthcom/hscore-dataset-pruning
pytorch
GitHub 中提及
MS-P3/code7/tree/main/xlm_roberta_xl
mindspore
lashoun/hanna-benchmark-asg
pytorch
GitHub 中提及
kaushaltrivedi/fast-bert
pytorch
GitHub 中提及
octanove/shiba
pytorch
GitHub 中提及
traviscoan/cards
GitHub 中提及
utterworks/fast-bert
pytorch
GitHub 中提及
zaradana/Fast_BERT
pytorch
GitHub 中提及
IndicoDataSolutions/finetune
tf
GitHub 中提及
brightmart/roberta_zh
tf
GitHub 中提及
few-shot-NER-benchmark/BaselineCode
pytorch
GitHub 中提及
ibm/vira-intent-discovery
GitHub 中提及
blawok/named-entity-recognition
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| common-sense-reasoning-on-commonsenseqa | RoBERTa-Large 355M | Accuracy: 72.1 |
| common-sense-reasoning-on-swag | RoBERTa | Test: 89.9 |
| document-image-classification-on-rvl-cdip | Roberta base | Accuracy: 90.06 Parameters: 125M |
| linguistic-acceptability-on-cola | RoBERTa (ensemble) | Accuracy: 67.8% |
| multi-task-language-understanding-on-mmlu | RoBERTa-base 125M (fine-tuned) | Average (%): 27.9 |
| natural-language-inference-on-anli-test | RoBERTa (Large) | A1: 72.4 A2: 49.8 A3: 44.4 |
| natural-language-inference-on-multinli | RoBERTa | Matched: 90.8 |
| natural-language-inference-on-multinli | RoBERTa (ensemble) | Mismatched: 90.2 |
| natural-language-inference-on-qnli | RoBERTa (ensemble) | Accuracy: 98.9% |
| natural-language-inference-on-rte | RoBERTa | Accuracy: 88.2% |
| natural-language-inference-on-rte | RoBERTa (ensemble) | Accuracy: 88.2% |
| natural-language-inference-on-wnli | RoBERTa (ensemble) | Accuracy: 89 |
| question-answering-on-piqa | RoBERTa-Large 355M | Accuracy: 79.4 |
| question-answering-on-quora-question-pairs | RoBERTa (ensemble) | Accuracy: 90.2% |
| question-answering-on-social-iqa | RoBERTa-Large 355M (fine-tuned) | Accuracy: 76.7 |
| question-answering-on-squad20 | RoBERTa (single model) | EM: 86.820 F1: 89.795 |
| question-answering-on-squad20-dev | RoBERTa (no data aug) | EM: 86.5 F1: 89.4 |
| reading-comprehension-on-race | RoBERTa | Accuracy: 83.2 Accuracy (High): 81.3 Accuracy (Middle): 86.5 |
| semantic-textual-similarity-on-mrpc | RoBERTa (ensemble) | Accuracy: 92.3% |
| semantic-textual-similarity-on-sts-benchmark | RoBERTa | Pearson Correlation: 0.922 |
| sentiment-analysis-on-sst-2-binary | RoBERTa (ensemble) | Accuracy: 96.7 |
| stock-market-prediction-on-astock | RoBERTa WWM Ext (News+Factors) | Accuray: 62.49 F1-score: 62.54 Precision: 62.59 Recall: 62.51 |
| stock-market-prediction-on-astock | RoBERTa WWM Ext (News) | Accuray: 61.34 F1-score: 61.48 Precision: 61.97 Recall: 61.32 |
| task-1-grouping-on-ocw | RoBERTa (LARGE) | # Correct Groups: 29 ± 3 # Solved Walls: 0 ± 0 Adjusted Mutual Information (AMI): 9.4 ± .4 Adjusted Rand Index (ARI): 8.4 ± .3 Fowlkes Mallows Score (FMS): 26.7 ± .2 Wasserstein Distance (WD): 88.4 ± .4 |
| text-classification-on-arxiv-10 | RoBERTa | Accuracy: 0.779 |
| type-prediction-on-manytypes4typescript | RoBERTa | Average Accuracy: 59.84 Average F1: 57.54 Average Precision: 57.45 Average Recall: 57.62 |