5 months ago

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Yu Gu; Robert Tinn; Hao Cheng; Michael Lucas; Naoto Usuyama; Xiaodong Liu; Tristan Naumann; Jianfeng Gao; Hoifung Poon

Abstract

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.

Code Repositories

rohanshad/cmr_transformer

pytorch

Mentioned in GitHub

bionlu-coling2024/biomed-ner-intent_detection

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
document-classification-on-hoc	PubMedBERT uncased	Micro F1: 82.32
drug-drug-interaction-extraction-on-ddi	PubMedBERT	F1: 0.8236 Micro F1: 82.36
named-entity-recognition-ner-on-jnlpba	PubMedBERT uncased	F1: 79.1
named-entity-recognition-ner-on-ncbi-disease	PubMedBERT uncased	F1: 87.82
named-entity-recognition-on-bc2gm	PubMedBERT uncased	F1: 84.52
participant-intervention-comparison-outcome	PubMedBERT uncased	F1: 73.38
pico-on-ebm-pico	PubMedBERT uncased	Macro F1 word level: 73.38
question-answering-on-bioasq	PubMedBERT uncased	Accuracy: 87.56
question-answering-on-blurb	PubMedBERT (uncased; abstracts)	Accuracy: 71.7
question-answering-on-pubmedqa	PubMedBERT uncased	Accuracy: 55.84
relation-extraction-on-chemprot	PubMedBERT uncased	Micro F1: 77.24
relation-extraction-on-ddi	PubMedBERT uncased	Micro F1: 82.36
relation-extraction-on-gad	PubMedBERT uncased	Micro F1: 82.34
text-classification-on-blurb	PubMedBERT (uncased; abstracts)	F1: 82.32

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette