Command Palette
Search for a command to run...
MassSpecGym: A benchmark for the discovery and identification of molecules
Roman Bushuiev; Anton Bushuiev; Niek F. de Jonge; Adamo Young; Fleming Kretschmer; Raman Samusevich; Janne Heirman; Fei Wang; Luke Zhang; Kai Dührkop; Marcus Ludwig; Nils A. Haupt; Apurva Kalia; Corinna Brungs; Robin Schmid; Russell Greiner; Bo Wang; David S. Wishart; Li-Ping Liu; Juho Rousu; Wout Bittremieux; Hannes Rost; Tytus D. Mak; Soha Hassoun; Florian Huber; Justin J.J. van der Hooft; Michael A. Stravs; Sebastian Böcker; Josef Sivic; Tomáš Pluskal

Abstract
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| de-novo-molecule-generation-from-ms-ms | Random chemical generation | Top-1 Accuracy: 0.00 Top-1 MCES: 28.59 Top-1 Tanimoto: 0.07 Top-10 Accuracy: 0.00 Top-10 MCES: 25.72 Top-10 Tanimoto: 0.10 |
| de-novo-molecule-generation-from-ms-ms | SELFIES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 33.28 Top-1 Tanimoto: 0.10 Top-10 Accuracy: 0.00 Top-10 MCES: 21.84 Top-10 Tanimoto: 0.15 |
| de-novo-molecule-generation-from-ms-ms | SMILES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 53.80 Top-1 Tanimoto: 0.07 Top-10 Accuracy: 0.00 Top-10 MCES: 21.97 Top-10 Tanimoto: 0.17 |
| de-novo-molecule-generation-from-ms-ms-1 | SMILES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 79.39 Top-1 Tanimoto: 0.03 Top-10 Accuracy: 0.00 Top-10 MCES: 52.13 Top-10 Tanimoto: 0.10 |
| de-novo-molecule-generation-from-ms-ms-1 | Random chemical generation | Top-1 Accuracy: 0.00 Top-1 MCES: 21.11 Top-1 Tanimoto: 0.08 Top-10 Accuracy: 0.00 Top-10 MCES: 18.25 Top-10 Tanimoto: 0.11 |
| de-novo-molecule-generation-from-ms-ms-1 | SELFIES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 38.88 Top-1 Tanimoto: 0.08 Top-10 Accuracy: 0.00 Top-10 MCES: 26.87 Top-10 Tanimoto: 0.13 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | DeepSets | Hit rate @ 1: 4.42 Hit rate @ 20: 30.76 Hit rate @ 5: 14.46 MCES @ 1: 15.04 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | MIST | Hit rate @ 1: 9.57 Hit rate @ 20: 41.12 Hit rate @ 5: 22.11 MCES @ 1: 12.75 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | Random | Hit rate @ 1: 3.06 Hit rate @ 20: 27.74 Hit rate @ 5: 11.35 MCES @ 1: 13.87 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | DeepSets + Fourier features | Hit rate @ 1: 6.56 Hit rate @ 20: 33.46 Hit rate @ 5: 16.46 MCES @ 1: 14.14 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | Fingerprint FFN | Hit rate @ 1: 5.09 Hit rate @ 20: 31.97 Hit rate @ 5: 14.69 MCES @ 1: 14.94 |
| molecule-retrieval-from-ms-ms-spectrum-on | DeepSets + Fourier features | Hit rate @ 1: 5.24 Hit rate @ 20: 28.21 Hit rate @ 5: 12.58 MCES @ 1: 22.13 |
| molecule-retrieval-from-ms-ms-spectrum-on | Fingerprint FFN | Hit rate @ 1: 2.54 Hit rate @ 20: 20.00 Hit rate @ 5: 7.59 MCES @ 1: 24.66 |
| molecule-retrieval-from-ms-ms-spectrum-on | MIST | Hit rate @ 1: 14.64 Hit rate @ 20: 59.15 Hit rate @ 5: 34.87 MCES @ 1: 15.37 |
| molecule-retrieval-from-ms-ms-spectrum-on | DeepSets | Hit rate @ 1: 1.47 Hit rate @ 20: 19.23 Hit rate @ 5: 6.21 MCES @ 1: 25.11 |
| molecule-retrieval-from-ms-ms-spectrum-on | Random | Hit rate @ 1: 0.37 Hit rate @ 20: 8.22 Hit rate @ 5: 2.01 MCES @ 1: 30.81 |
| ms-ms-spectrum-simulation-bonus-chemical | Precursor m/z | Hit Rate @ 1: 2.09 Hit Rate @ 20: 22.65 Hit Rate @ 5: 8.52 |
| ms-ms-spectrum-simulation-bonus-chemical | FFN Fingerprint | Hit Rate @ 1: 7.62 Hit Rate @ 20: 44.12 Hit Rate @ 5: 22.70 |
| ms-ms-spectrum-simulation-bonus-chemical | FraGNNet | Hit Rate @ 1: 31.93 Hit Rate @ 20: 82.70 Hit Rate @ 5: 63.20 |
| ms-ms-spectrum-simulation-bonus-chemical | GNN | Hit Rate @ 1: 3.63 Hit Rate @ 20: 33.77 Hit Rate @ 5: 13.55 |
| ms-ms-spectrum-simulation-on-massspecgym | GNN | Cosine Similarity: 0.19 Hit Rate @ 1: 3.95 Hit Rate @ 20: 26.27 Hit Rate @ 5: 11.92 Jensen-Shannon Similarity: 0.20 |
| ms-ms-spectrum-simulation-on-massspecgym | FFN Fingerprint | Cosine Similarity: 0.25 Hit Rate @ 1: 8.44 Hit Rate @ 20: 38.57 Hit Rate @ 5: 21.43 Jensen-Shannon Similarity: 0.24 |
| ms-ms-spectrum-simulation-on-massspecgym | Precursor m/z | Cosine Similarity: 0.15 Hit Rate @ 1: 0.38 Hit Rate @ 20: 7.17 Hit Rate @ 5: 1.72 Jensen-Shannon Similarity: 0.15 |
| ms-ms-spectrum-simulation-on-massspecgym | FraGNNet | Cosine Similarity: 0.52 Hit Rate @ 1: 46.64 Hit Rate @ 20: 83.58 Hit Rate @ 5: 72.56 Jensen-Shannon Similarity: 0.47 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.