HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LinkBERT: Pretraining Language Models with Document Links

Michihiro Yasunaga; Jure Leskovec; Percy Liang

LinkBERT: Pretraining Language Models with Document Links

Abstract

Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.

Code Repositories

michiyasunaga/LinkBERT
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
document-classification-on-hocBioLinkBERT (large)
F1: 88.1
Micro F1: 84.87
medical-relation-extraction-on-ddi-extractionBioLinkBERT (large)
F1: 83.35
named-entity-recognition-ner-on-bc5cdrBioLinkBERT (large)
F1: 90.22
named-entity-recognition-ner-on-jnlpbaBioLinkBERT (large)
F1: 80.06
named-entity-recognition-ner-on-ncbi-diseaseBioLinkBERT (large)
F1: 88.76
named-entity-recognition-on-bc2gmBioLinkBERT (large)
F1: 85.18
named-entity-recognition-on-bc5cdr-chemicalBioLinkBERT (large)
F1: 94.04
named-entity-recognition-on-bc5cdr-diseaseBioLinkBERT (large)
F1: 86.39
pico-on-ebm-picoBioLinkBERT (base)
Macro F1 word level: 73.97
pico-on-ebm-picoBioLinkBERT (large)
Macro F1 word level: 74.19
question-answering-on-bioasqBioLinkBERT (base)
Accuracy: 91.4
question-answering-on-bioasqBioLinkBERT (large)
Accuracy: 94.8
question-answering-on-blurbBioLinkBERT (base)
Accuracy: 80.81
question-answering-on-blurbBioLinkBERT (large)
Accuracy: 83.5
question-answering-on-medqa-usmleBioLinkBERT (base)
Accuracy: 40.0
question-answering-on-mrqa-2019LinkBERT (large)
Average F1: 81.0
question-answering-on-newsqaLinkBERT (large)
F1: 72.6
question-answering-on-pubmedqaBioLinkBERT (base)
Accuracy: 70.2
question-answering-on-pubmedqaBioLinkBERT (large)
Accuracy: 72.2
question-answering-on-squad11LinkBERT (large)
EM: 87.45
F1: 92.7
question-answering-on-triviaqaLinkBERT (large)
F1: 78.2
relation-extraction-on-chemprotBioLinkBERT (large)
F1: 79.98
Micro F1: 79.98
relation-extraction-on-ddiBioLinkBERT (large)
F1: 83.35
Micro F1: 83.35
relation-extraction-on-gadBioLinkBERT (large)
F1: 84.90
Micro F1: 84.90
semantic-similarity-on-biossesBioLinkBERT (base)
Pearson Correlation: 0.9325
semantic-similarity-on-biossesBioLinkBERT (large)
Pearson Correlation: 0.9363
text-classification-on-blurbBioLinkBERT (base)
F1: 84.35
text-classification-on-blurbBioLinkBERT (large)
F1: 84.88

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LinkBERT: Pretraining Language Models with Document Links | Papers | HyperAI