
摘要
语言模型(LM)预训练可以从文本语料库中学习各种知识,从而帮助下游任务。然而,现有的方法如BERT仅对单个文档进行建模,无法捕捉跨文档的依赖关系或知识。在本研究中,我们提出了一种新的语言模型预训练方法——LinkBERT,该方法利用了文档之间的链接(例如超链接)。给定一个文本语料库,我们将其视为一个文档图,并通过将链接的文档置于同一上下文中来创建语言模型输入。随后,我们使用两个联合自监督目标对语言模型进行预训练:掩码语言建模和我们新提出的文档关系预测。实验结果表明,LinkBERT在两个领域的多种下游任务上均优于BERT:通用领域(在包含超链接的维基百科上预训练)和生物医学领域(在包含引用链接的PubMed上预训练)。LinkBERT特别适用于多跳推理和少样本问答任务,在HotpotQA和TriviaQA数据集上取得了5%的绝对性能提升;我们的生物医学LinkBERT在多个BioNLP任务上也达到了新的最佳水平,在BioASQ和USMLE数据集上的性能提升了7%。我们发布了预训练模型LinkBERT和BioLinkBERT,以及相关代码和数据,详情见https://github.com/michiyasunaga/LinkBERT。
代码仓库
michiyasunaga/LinkBERT
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| document-classification-on-hoc | BioLinkBERT (large) | F1: 88.1 Micro F1: 84.87 |
| medical-relation-extraction-on-ddi-extraction | BioLinkBERT (large) | F1: 83.35 |
| named-entity-recognition-ner-on-bc5cdr | BioLinkBERT (large) | F1: 90.22 |
| named-entity-recognition-ner-on-jnlpba | BioLinkBERT (large) | F1: 80.06 |
| named-entity-recognition-ner-on-ncbi-disease | BioLinkBERT (large) | F1: 88.76 |
| named-entity-recognition-on-bc2gm | BioLinkBERT (large) | F1: 85.18 |
| named-entity-recognition-on-bc5cdr-chemical | BioLinkBERT (large) | F1: 94.04 |
| named-entity-recognition-on-bc5cdr-disease | BioLinkBERT (large) | F1: 86.39 |
| pico-on-ebm-pico | BioLinkBERT (base) | Macro F1 word level: 73.97 |
| pico-on-ebm-pico | BioLinkBERT (large) | Macro F1 word level: 74.19 |
| question-answering-on-bioasq | BioLinkBERT (base) | Accuracy: 91.4 |
| question-answering-on-bioasq | BioLinkBERT (large) | Accuracy: 94.8 |
| question-answering-on-blurb | BioLinkBERT (base) | Accuracy: 80.81 |
| question-answering-on-blurb | BioLinkBERT (large) | Accuracy: 83.5 |
| question-answering-on-medqa-usmle | BioLinkBERT (base) | Accuracy: 40.0 |
| question-answering-on-mrqa-2019 | LinkBERT (large) | Average F1: 81.0 |
| question-answering-on-newsqa | LinkBERT (large) | F1: 72.6 |
| question-answering-on-pubmedqa | BioLinkBERT (base) | Accuracy: 70.2 |
| question-answering-on-pubmedqa | BioLinkBERT (large) | Accuracy: 72.2 |
| question-answering-on-squad11 | LinkBERT (large) | EM: 87.45 F1: 92.7 |
| question-answering-on-triviaqa | LinkBERT (large) | F1: 78.2 |
| relation-extraction-on-chemprot | BioLinkBERT (large) | F1: 79.98 Micro F1: 79.98 |
| relation-extraction-on-ddi | BioLinkBERT (large) | F1: 83.35 Micro F1: 83.35 |
| relation-extraction-on-gad | BioLinkBERT (large) | F1: 84.90 Micro F1: 84.90 |
| semantic-similarity-on-biosses | BioLinkBERT (base) | Pearson Correlation: 0.9325 |
| semantic-similarity-on-biosses | BioLinkBERT (large) | Pearson Correlation: 0.9363 |
| text-classification-on-blurb | BioLinkBERT (base) | F1: 84.35 |
| text-classification-on-blurb | BioLinkBERT (large) | F1: 84.88 |