Command Palette
Search for a command to run...
Carl Edwards; Tuan Lai; Kevin Ros; Garrett Honke; Kyunghyun Cho; Heng Ji

Abstract
We present $\textbf{MolT5}$ $-$ a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| molecule-captioning-on-chebi-20 | MolT5-Base | BLEU-2: 54.0 BLEU-4: 45.7 METEOR: 56.9 ROUGE-1: 63.4 ROUGE-2: 48.5 ROUGE-L: 57.8 Text2Mol: 54.7 |
| molecule-captioning-on-chebi-20 | MolT5-Large | BLEU-2: 59.4 BLEU-4: 50.8 METEOR: 61.4 ROUGE-1: 65.4 ROUGE-2: 51.0 ROUGE-L: 59.4 Text2Mol: 58.2 |
| molecule-captioning-on-chebi-20 | MolT5-Small | BLEU-2: 51.9 BLEU-4: 43.6 METEOR: 55.1 ROUGE-1: 62.0 ROUGE-2: 46.9 ROUGE-L: 56.3 Text2Mol: 54.0 |
| molecule-captioning-on-l-m-24 | MolT5-Small | BLEU-2: 70.9 BLEU-4: 51.2 METEOR: 70.1 ROUGE-1: 74.5 ROUGE-2: 55.8 ROUGE-L: 54.4 |
| molecule-captioning-on-l-m-24 | MolT5-Base | BLEU-2: 73.8 BLEU-4: 53.5 METEOR: 71.8 ROUGE-1: 75.0 ROUGE-2: 55.9 ROUGE-L: 53.9 |
| molecule-captioning-on-l-m-24 | MolT5-Large | BLEU-2: 76.9 BLEU-4: 55.6 METEOR: 74.3 ROUGE-1: 77.7 ROUGE-2: 58.0 ROUGE-L: 55.7 |
| text-based-de-novo-molecule-generation-on | MolT5-Large | BLEU: 85.4 Exact Match: 30.2 Frechet ChemNet Distance (FCD): 1.20 Levenshtein: 16.07 MACCS FTS: 83.4 Morgan FTS: 68.4 Parameter Count: 770000000 RDK FTS: 74.6 Text2Mol: 55.4 Validity: 90.5 |
| text-based-de-novo-molecule-generation-on | MolT5-small | BLEU: 75.5 Exact Match: 7.9 Frechet ChemNet Distance (FCD): 2.49 Levenshtein: 25.988 MACCS FTS: 70.3 Morgan FTS: 51.7 Parameter Count: 60000000 RDK FTS: 56.8 Text2Mol: 48.2 Validity: 72.1 |
| text-based-de-novo-molecule-generation-on | MolT5-Large-HV | BLEU: 81.0 Exact Match: 31.4 Frechet ChemNet Distance (FCD): 0.44 Levenshtein: 16.758 MACCS FTS: 87.2 Morgan FTS: 72.2 Parameter Count: 770000000 RDK FTS: 78.6 Text2Mol: 59.0 Validity: 99.6 |
| text-based-de-novo-molecule-generation-on | MolT5-base | BLEU: 76.9 Exact Match: 8.1 Frechet ChemNet Distance (FCD): 2.18 Levenshtein: 24.458 MACCS FTS: 72.1 Morgan FTS: 52.9 Parameter Count: 220000000 RDK FTS: 58.8 Text2Mol: 49.6 Validity: 77.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.