4 个月前

MassSpecGym:分子发现与鉴定的基准测试平台

MassSpecGym:分子发现与鉴定的基准测试平台

摘要

在生物和环境样本中发现并鉴定分子对于推进生物医学和化学科学至关重要。串联质谱(MS/MS)是高通量解析分子结构的主要技术。然而,即使由人类专家进行操作,从质谱数据中解码分子结构也极为困难。因此,绝大多数获得的MS/MS谱图仍未被解释,从而限制了我们对潜在(生)化学过程的理解。尽管几十年来在利用机器学习方法预测MS/MS谱图中的分子结构方面取得了进展,但新方法的开发严重受到缺乏标准数据集和评估协议的阻碍。为了解决这一问题,我们提出了MassSpecGym——首个全面的基于MS/MS数据的分子发现与鉴定基准平台。该基准平台包含了最大规模的公开可用高质量标记MS/MS谱图集合,并定义了三个MS/MS注释挑战:从头分子结构生成、分子检索和谱图模拟。它引入了新的评估指标和具有泛化要求的数据划分方式,从而标准化了MS/MS注释任务,并使这一问题能够被广泛的机器学习社区所接受。MassSpecGym已在https://github.com/pluskal-lab/MassSpecGym上公开发布。

代码仓库

pluskal-lab/massspecgym
官方
pytorch
GitHub 中提及

基准测试

基准方法指标
de-novo-molecule-generation-from-ms-msRandom chemical generation
Top-1 Accuracy: 0.00
Top-1 MCES: 28.59
Top-1 Tanimoto: 0.07
Top-10 Accuracy: 0.00
Top-10 MCES: 25.72
Top-10 Tanimoto: 0.10
de-novo-molecule-generation-from-ms-msSELFIES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 33.28
Top-1 Tanimoto: 0.10
Top-10 Accuracy: 0.00
Top-10 MCES: 21.84
Top-10 Tanimoto: 0.15
de-novo-molecule-generation-from-ms-msSMILES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 53.80
Top-1 Tanimoto: 0.07
Top-10 Accuracy: 0.00
Top-10 MCES: 21.97
Top-10 Tanimoto: 0.17
de-novo-molecule-generation-from-ms-ms-1SMILES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 79.39
Top-1 Tanimoto: 0.03
Top-10 Accuracy: 0.00
Top-10 MCES: 52.13
Top-10 Tanimoto: 0.10
de-novo-molecule-generation-from-ms-ms-1Random chemical generation
Top-1 Accuracy: 0.00
Top-1 MCES: 21.11
Top-1 Tanimoto: 0.08
Top-10 Accuracy: 0.00
Top-10 MCES: 18.25
Top-10 Tanimoto: 0.11
de-novo-molecule-generation-from-ms-ms-1SELFIES Transformer
Top-1 Accuracy: 0.00
Top-1 MCES: 38.88
Top-1 Tanimoto: 0.08
Top-10 Accuracy: 0.00
Top-10 MCES: 26.87
Top-10 Tanimoto: 0.13
molecule-retrieval-from-ms-ms-spectrum-bonusDeepSets
Hit rate @ 1: 4.42
Hit rate @ 20: 30.76
Hit rate @ 5: 14.46
MCES @ 1: 15.04
molecule-retrieval-from-ms-ms-spectrum-bonusMIST
Hit rate @ 1: 9.57
Hit rate @ 20: 41.12
Hit rate @ 5: 22.11
MCES @ 1: 12.75
molecule-retrieval-from-ms-ms-spectrum-bonusRandom
Hit rate @ 1: 3.06
Hit rate @ 20: 27.74
Hit rate @ 5: 11.35
MCES @ 1: 13.87
molecule-retrieval-from-ms-ms-spectrum-bonusDeepSets + Fourier features
Hit rate @ 1: 6.56
Hit rate @ 20: 33.46
Hit rate @ 5: 16.46
MCES @ 1: 14.14
molecule-retrieval-from-ms-ms-spectrum-bonusFingerprint FFN
Hit rate @ 1: 5.09
Hit rate @ 20: 31.97
Hit rate @ 5: 14.69
MCES @ 1: 14.94
molecule-retrieval-from-ms-ms-spectrum-onDeepSets + Fourier features
Hit rate @ 1: 5.24
Hit rate @ 20: 28.21
Hit rate @ 5: 12.58
MCES @ 1: 22.13
molecule-retrieval-from-ms-ms-spectrum-onFingerprint FFN
Hit rate @ 1: 2.54
Hit rate @ 20: 20.00
Hit rate @ 5: 7.59
MCES @ 1: 24.66
molecule-retrieval-from-ms-ms-spectrum-onMIST
Hit rate @ 1: 14.64
Hit rate @ 20: 59.15
Hit rate @ 5: 34.87
MCES @ 1: 15.37
molecule-retrieval-from-ms-ms-spectrum-onDeepSets
Hit rate @ 1: 1.47
Hit rate @ 20: 19.23
Hit rate @ 5: 6.21
MCES @ 1: 25.11
molecule-retrieval-from-ms-ms-spectrum-onRandom
Hit rate @ 1: 0.37
Hit rate @ 20: 8.22
Hit rate @ 5: 2.01
MCES @ 1: 30.81
ms-ms-spectrum-simulation-bonus-chemicalPrecursor m/z
Hit Rate @ 1: 2.09
Hit Rate @ 20: 22.65
Hit Rate @ 5: 8.52
ms-ms-spectrum-simulation-bonus-chemicalFFN Fingerprint
Hit Rate @ 1: 7.62
Hit Rate @ 20: 44.12
Hit Rate @ 5: 22.70
ms-ms-spectrum-simulation-bonus-chemicalFraGNNet
Hit Rate @ 1: 31.93
Hit Rate @ 20: 82.70
Hit Rate @ 5: 63.20
ms-ms-spectrum-simulation-bonus-chemicalGNN
Hit Rate @ 1: 3.63
Hit Rate @ 20: 33.77
Hit Rate @ 5: 13.55
ms-ms-spectrum-simulation-on-massspecgymGNN
Cosine Similarity: 0.19
Hit Rate @ 1: 3.95
Hit Rate @ 20: 26.27
Hit Rate @ 5: 11.92
Jensen-Shannon Similarity: 0.20
ms-ms-spectrum-simulation-on-massspecgymFFN Fingerprint
Cosine Similarity: 0.25
Hit Rate @ 1: 8.44
Hit Rate @ 20: 38.57
Hit Rate @ 5: 21.43
Jensen-Shannon Similarity: 0.24
ms-ms-spectrum-simulation-on-massspecgymPrecursor m/z
Cosine Similarity: 0.15
Hit Rate @ 1: 0.38
Hit Rate @ 20: 7.17
Hit Rate @ 5: 1.72
Jensen-Shannon Similarity: 0.15
ms-ms-spectrum-simulation-on-massspecgymFraGNNet
Cosine Similarity: 0.52
Hit Rate @ 1: 46.64
Hit Rate @ 20: 83.58
Hit Rate @ 5: 72.56
Jensen-Shannon Similarity: 0.47

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
MassSpecGym:分子发现与鉴定的基准测试平台 | 论文 | HyperAI超神经