
摘要
在生物和环境样本中发现并鉴定分子对于推进生物医学和化学科学至关重要。串联质谱(MS/MS)是高通量解析分子结构的主要技术。然而,即使由人类专家进行操作,从质谱数据中解码分子结构也极为困难。因此,绝大多数获得的MS/MS谱图仍未被解释,从而限制了我们对潜在(生)化学过程的理解。尽管几十年来在利用机器学习方法预测MS/MS谱图中的分子结构方面取得了进展,但新方法的开发严重受到缺乏标准数据集和评估协议的阻碍。为了解决这一问题,我们提出了MassSpecGym——首个全面的基于MS/MS数据的分子发现与鉴定基准平台。该基准平台包含了最大规模的公开可用高质量标记MS/MS谱图集合,并定义了三个MS/MS注释挑战:从头分子结构生成、分子检索和谱图模拟。它引入了新的评估指标和具有泛化要求的数据划分方式,从而标准化了MS/MS注释任务,并使这一问题能够被广泛的机器学习社区所接受。MassSpecGym已在https://github.com/pluskal-lab/MassSpecGym上公开发布。
代码仓库
pluskal-lab/massspecgym
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| de-novo-molecule-generation-from-ms-ms | Random chemical generation | Top-1 Accuracy: 0.00 Top-1 MCES: 28.59 Top-1 Tanimoto: 0.07 Top-10 Accuracy: 0.00 Top-10 MCES: 25.72 Top-10 Tanimoto: 0.10 |
| de-novo-molecule-generation-from-ms-ms | SELFIES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 33.28 Top-1 Tanimoto: 0.10 Top-10 Accuracy: 0.00 Top-10 MCES: 21.84 Top-10 Tanimoto: 0.15 |
| de-novo-molecule-generation-from-ms-ms | SMILES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 53.80 Top-1 Tanimoto: 0.07 Top-10 Accuracy: 0.00 Top-10 MCES: 21.97 Top-10 Tanimoto: 0.17 |
| de-novo-molecule-generation-from-ms-ms-1 | SMILES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 79.39 Top-1 Tanimoto: 0.03 Top-10 Accuracy: 0.00 Top-10 MCES: 52.13 Top-10 Tanimoto: 0.10 |
| de-novo-molecule-generation-from-ms-ms-1 | Random chemical generation | Top-1 Accuracy: 0.00 Top-1 MCES: 21.11 Top-1 Tanimoto: 0.08 Top-10 Accuracy: 0.00 Top-10 MCES: 18.25 Top-10 Tanimoto: 0.11 |
| de-novo-molecule-generation-from-ms-ms-1 | SELFIES Transformer | Top-1 Accuracy: 0.00 Top-1 MCES: 38.88 Top-1 Tanimoto: 0.08 Top-10 Accuracy: 0.00 Top-10 MCES: 26.87 Top-10 Tanimoto: 0.13 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | DeepSets | Hit rate @ 1: 4.42 Hit rate @ 20: 30.76 Hit rate @ 5: 14.46 MCES @ 1: 15.04 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | MIST | Hit rate @ 1: 9.57 Hit rate @ 20: 41.12 Hit rate @ 5: 22.11 MCES @ 1: 12.75 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | Random | Hit rate @ 1: 3.06 Hit rate @ 20: 27.74 Hit rate @ 5: 11.35 MCES @ 1: 13.87 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | DeepSets + Fourier features | Hit rate @ 1: 6.56 Hit rate @ 20: 33.46 Hit rate @ 5: 16.46 MCES @ 1: 14.14 |
| molecule-retrieval-from-ms-ms-spectrum-bonus | Fingerprint FFN | Hit rate @ 1: 5.09 Hit rate @ 20: 31.97 Hit rate @ 5: 14.69 MCES @ 1: 14.94 |
| molecule-retrieval-from-ms-ms-spectrum-on | DeepSets + Fourier features | Hit rate @ 1: 5.24 Hit rate @ 20: 28.21 Hit rate @ 5: 12.58 MCES @ 1: 22.13 |
| molecule-retrieval-from-ms-ms-spectrum-on | Fingerprint FFN | Hit rate @ 1: 2.54 Hit rate @ 20: 20.00 Hit rate @ 5: 7.59 MCES @ 1: 24.66 |
| molecule-retrieval-from-ms-ms-spectrum-on | MIST | Hit rate @ 1: 14.64 Hit rate @ 20: 59.15 Hit rate @ 5: 34.87 MCES @ 1: 15.37 |
| molecule-retrieval-from-ms-ms-spectrum-on | DeepSets | Hit rate @ 1: 1.47 Hit rate @ 20: 19.23 Hit rate @ 5: 6.21 MCES @ 1: 25.11 |
| molecule-retrieval-from-ms-ms-spectrum-on | Random | Hit rate @ 1: 0.37 Hit rate @ 20: 8.22 Hit rate @ 5: 2.01 MCES @ 1: 30.81 |
| ms-ms-spectrum-simulation-bonus-chemical | Precursor m/z | Hit Rate @ 1: 2.09 Hit Rate @ 20: 22.65 Hit Rate @ 5: 8.52 |
| ms-ms-spectrum-simulation-bonus-chemical | FFN Fingerprint | Hit Rate @ 1: 7.62 Hit Rate @ 20: 44.12 Hit Rate @ 5: 22.70 |
| ms-ms-spectrum-simulation-bonus-chemical | FraGNNet | Hit Rate @ 1: 31.93 Hit Rate @ 20: 82.70 Hit Rate @ 5: 63.20 |
| ms-ms-spectrum-simulation-bonus-chemical | GNN | Hit Rate @ 1: 3.63 Hit Rate @ 20: 33.77 Hit Rate @ 5: 13.55 |
| ms-ms-spectrum-simulation-on-massspecgym | GNN | Cosine Similarity: 0.19 Hit Rate @ 1: 3.95 Hit Rate @ 20: 26.27 Hit Rate @ 5: 11.92 Jensen-Shannon Similarity: 0.20 |
| ms-ms-spectrum-simulation-on-massspecgym | FFN Fingerprint | Cosine Similarity: 0.25 Hit Rate @ 1: 8.44 Hit Rate @ 20: 38.57 Hit Rate @ 5: 21.43 Jensen-Shannon Similarity: 0.24 |
| ms-ms-spectrum-simulation-on-massspecgym | Precursor m/z | Cosine Similarity: 0.15 Hit Rate @ 1: 0.38 Hit Rate @ 20: 7.17 Hit Rate @ 5: 1.72 Jensen-Shannon Similarity: 0.15 |
| ms-ms-spectrum-simulation-on-massspecgym | FraGNNet | Cosine Similarity: 0.52 Hit Rate @ 1: 46.64 Hit Rate @ 20: 83.58 Hit Rate @ 5: 72.56 Jensen-Shannon Similarity: 0.47 |