HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev; Anton Bushuiev; Niek F. de Jonge; Adamo Young; Fleming Kretschmer; Raman Samusevich; Janne Heirman; Fei Wang; Luke Zhang; Kai Dührkop; Marcus Ludwig; Nils A. Haupt; Apurva Kalia; Corinna Brungs; Robin Schmid; Russell Greiner; Bo Wang; David S. Wishart; Li-Ping Liu; Juho Rousu; Wout Bittremieux; Hannes Rost; Tytus D. Mak; Soha Hassoun; Florian Huber; Justin J.J. van der Hooft; Michael A. Stravs; Sebastian Böcker; Josef Sivic; Tomáš Pluskal

MassSpecGym: A benchmark for the discovery and identification of molecules

Abstract

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

Code Repositories

pluskal-lab/massspecgym
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
de-novo-molecule-generation-from-ms-msRandom chemical generation
Top-1 Accuracy: 0.00
Top-1 MCES: 28.59
Top-1 Tanimoto: 0.07
Top-10 Accuracy: 0.00
Top-10 MCES: 25.72
Top-10 Tanimoto: 0.10
de-novo-molecule-generation-from-ms-msSELFIES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 33.28
Top-1 Tanimoto: 0.10
Top-10 Accuracy: 0.00
Top-10 MCES: 21.84
Top-10 Tanimoto: 0.15
de-novo-molecule-generation-from-ms-msSMILES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 53.80
Top-1 Tanimoto: 0.07
Top-10 Accuracy: 0.00
Top-10 MCES: 21.97
Top-10 Tanimoto: 0.17
de-novo-molecule-generation-from-ms-ms-1SMILES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 79.39
Top-1 Tanimoto: 0.03
Top-10 Accuracy: 0.00
Top-10 MCES: 52.13
Top-10 Tanimoto: 0.10
de-novo-molecule-generation-from-ms-ms-1Random chemical generation
Top-1 Accuracy: 0.00
Top-1 MCES: 21.11
Top-1 Tanimoto: 0.08
Top-10 Accuracy: 0.00
Top-10 MCES: 18.25
Top-10 Tanimoto: 0.11
de-novo-molecule-generation-from-ms-ms-1SELFIES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 38.88
Top-1 Tanimoto: 0.08
Top-10 Accuracy: 0.00
Top-10 MCES: 26.87
Top-10 Tanimoto: 0.13
molecule-retrieval-from-ms-ms-spectrum-bonusDeepSets
Hit rate @ 1: 4.42
Hit rate @ 20: 30.76
Hit rate @ 5: 14.46
MCES @ 1: 15.04
molecule-retrieval-from-ms-ms-spectrum-bonusMIST
Hit rate @ 1: 9.57
Hit rate @ 20: 41.12
Hit rate @ 5: 22.11
MCES @ 1: 12.75
molecule-retrieval-from-ms-ms-spectrum-bonusRandom
Hit rate @ 1: 3.06
Hit rate @ 20: 27.74
Hit rate @ 5: 11.35
MCES @ 1: 13.87
molecule-retrieval-from-ms-ms-spectrum-bonusDeepSets + Fourier features
Hit rate @ 1: 6.56
Hit rate @ 20: 33.46
Hit rate @ 5: 16.46
MCES @ 1: 14.14
molecule-retrieval-from-ms-ms-spectrum-bonusFingerprint FFN
Hit rate @ 1: 5.09
Hit rate @ 20: 31.97
Hit rate @ 5: 14.69
MCES @ 1: 14.94
molecule-retrieval-from-ms-ms-spectrum-onDeepSets + Fourier features
Hit rate @ 1: 5.24
Hit rate @ 20: 28.21
Hit rate @ 5: 12.58
MCES @ 1: 22.13
molecule-retrieval-from-ms-ms-spectrum-onFingerprint FFN
Hit rate @ 1: 2.54
Hit rate @ 20: 20.00
Hit rate @ 5: 7.59
MCES @ 1: 24.66
molecule-retrieval-from-ms-ms-spectrum-onMIST
Hit rate @ 1: 14.64
Hit rate @ 20: 59.15
Hit rate @ 5: 34.87
MCES @ 1: 15.37
molecule-retrieval-from-ms-ms-spectrum-onDeepSets
Hit rate @ 1: 1.47
Hit rate @ 20: 19.23
Hit rate @ 5: 6.21
MCES @ 1: 25.11
molecule-retrieval-from-ms-ms-spectrum-onRandom
Hit rate @ 1: 0.37
Hit rate @ 20: 8.22
Hit rate @ 5: 2.01
MCES @ 1: 30.81
ms-ms-spectrum-simulation-bonus-chemicalPrecursor m/z
Hit Rate @ 1: 2.09
Hit Rate @ 20: 22.65
Hit Rate @ 5: 8.52
ms-ms-spectrum-simulation-bonus-chemicalFFN Fingerprint
Hit Rate @ 1: 7.62
Hit Rate @ 20: 44.12
Hit Rate @ 5: 22.70
ms-ms-spectrum-simulation-bonus-chemicalFraGNNet
Hit Rate @ 1: 31.93
Hit Rate @ 20: 82.70
Hit Rate @ 5: 63.20
ms-ms-spectrum-simulation-bonus-chemicalGNN
Hit Rate @ 1: 3.63
Hit Rate @ 20: 33.77
Hit Rate @ 5: 13.55
ms-ms-spectrum-simulation-on-massspecgymGNN
Cosine Similarity: 0.19
Hit Rate @ 1: 3.95
Hit Rate @ 20: 26.27
Hit Rate @ 5: 11.92
Jensen-Shannon Similarity: 0.20
ms-ms-spectrum-simulation-on-massspecgymFFN Fingerprint
Cosine Similarity: 0.25
Hit Rate @ 1: 8.44
Hit Rate @ 20: 38.57
Hit Rate @ 5: 21.43
Jensen-Shannon Similarity: 0.24
ms-ms-spectrum-simulation-on-massspecgymPrecursor m/z
Cosine Similarity: 0.15
Hit Rate @ 1: 0.38
Hit Rate @ 20: 7.17
Hit Rate @ 5: 1.72
Jensen-Shannon Similarity: 0.15
ms-ms-spectrum-simulation-on-massspecgymFraGNNet
Cosine Similarity: 0.52
Hit Rate @ 1: 46.64
Hit Rate @ 20: 83.58
Hit Rate @ 5: 72.56
Jensen-Shannon Similarity: 0.47

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MassSpecGym: A benchmark for the discovery and identification of molecules | Papers | HyperAI