HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Zhihan Zhou; Yanrong Ji; Weijian Li; Pratik Dutta; Ramana Davuluri; Han Liu

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Abstract

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.

Code Repositories

magics-lab/dnabert_2
Official
pytorch
Mentioned in GitHub
jimmylihui/genbench
pytorch
Mentioned in GitHub
zhihan1996/dnabert_2
Official
pytorch
Mentioned in GitHub
frederikkemarin/bend
pytorch
Mentioned in GitHub
jimmylihui/OpenGenome
pytorch
Mentioned in GitHub
jerryji1993/dnabert
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
core-promoter-detection-on-gueDNABERT-2-117M
MCC: 70.52
covid-variant-prediction-on-gueDNABERT-2-117M
Avg F1: 71.02
epigenetic-marks-prediction-on-gueDNABERT-2-117M
MCC: 55.98
promoter-detection-on-gueDNABERT-2-117M
MCC: 84.21
splice-site-prediction-on-gueDNABERT-2-117M
MCC: 84.99
transcription-factor-binding-site-predictionDNABERT-2-117M
MCC: 70.10
transcription-factor-binding-site-prediction-1DNABERT-2-117M
MCC: 67.99

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome | Papers | HyperAI