
摘要
近期,机器学习方法在重症监护室(ICU)收集的时间序列数据上的应用取得了显著成功,但也暴露了缺乏标准化的机器学习基准来开发和比较这些方法的问题。尽管像MIMIC-IV或eICU这样的原始数据集可以在Physionet上自由获取,但每篇论文中任务的选择和预处理往往是临时决定的,这限制了不同论文之间的可比性。在这项工作中,我们旨在通过提供一个涵盖广泛ICU相关任务的基准来改善这一状况。利用HiRID数据集,我们在临床医生的合作下定义了多个具有临床意义的任务。此外,我们还提供了一个可重现的端到端管道,用于构建数据和标签。最后,我们对当前最先进的序列建模方法进行了深入分析,指出了深度学习方法在处理此类数据时的一些局限性。通过这一基准,我们希望为研究社区提供一个公平比较其工作的机会。
代码仓库
ratschlab/HIRID-ICU-Benchmark
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| circulatory-failure-on-hirid | LSTM | AUPRC: 0.32.2±0.008 |
| circulatory-failure-on-hirid | LGBM | AUPRC: 0.389±0.003 |
| circulatory-failure-on-hirid | LGBM ( + hand crafted features) | AUPRC: 0.388±0.002 |
| circulatory-failure-on-hirid | TCN | AUPRC: 0.35.8±0.006 |
| circulatory-failure-on-hirid | GRU | AUPRC: 0.368±0.005 |
| circulatory-failure-on-hirid | Transformer | AUPRC: 0.352±0.006 |
| circulatory-failure-on-hirid | LR | AUPRC: 0.305±0.000 |
| icu-mortality-on-hirid | Logistic Regression | AUPRC: 0.581±0.000 |
| icu-mortality-on-hirid | Transformer | AUPRC: 0.610±0.008 |
| icu-mortality-on-hirid | GRU | AUPRC: 0.603 ±0.016 |
| icu-mortality-on-hirid | LSTM | AUPRC: 0.600±0.009 |
| icu-mortality-on-hirid | LGBM | AUPRC: 0.546±0.008 |
| icu-mortality-on-hirid | LGBM ( + hand crafted features) | AUPRC: 0.626±0.000 |
| icu-mortality-on-hirid | TCN | AUPRC: 0.602±0.011 |
| kidney-function-on-hirid | LSTM | MAE: 0.50±0.01 |
| kidney-function-on-hirid | GRU | MAE: 0.49±0.02 |
| kidney-function-on-hirid | LGBM ( + hand crafted features) | MAE: 0.45±0.00 |
| kidney-function-on-hirid | Transformer | MAE: 0.48±0.02 |
| kidney-function-on-hirid | TCN | MAE: 0.50±0.01 |
| kidney-function-on-hirid | LGBM | MAE: 0.45±0.00 |
| patient-phenotyping-on-hirid | TCN | Balanced Accuracy: 41.6±2.3 |
| patient-phenotyping-on-hirid | LGBM | Balanced Accuracy: 40.4±0.8 |
| patient-phenotyping-on-hirid | LGBM ( + hand crafted features) | Balanced Accuracy: 45.8±2.0 |
| patient-phenotyping-on-hirid | Transformer | Balanced Accuracy: 42.7±1.4 |
| patient-phenotyping-on-hirid | GRU | Balanced Accuracy: 39.2±2.1 |
| patient-phenotyping-on-hirid | Logistic Regression | Balanced Accuracy: 39.1±0.0 |
| patient-phenotyping-on-hirid | LSTM | Balanced Accuracy: 39.5±1.2 |
| remaining-length-of-stay-on-hirid | LGBM ( + hand crafted features) | MAE: 57.0±0.3 |
| remaining-length-of-stay-on-hirid | LGBM | MAE: 56.9±0.4 |
| remaining-length-of-stay-on-hirid | Transformer | MAE: 59.5±2.8 |
| remaining-length-of-stay-on-hirid | TCN | MAE: 59.8±2.8 |
| remaining-length-of-stay-on-hirid | LSTM | MAE: 60.7±1.6 |
| remaining-length-of-stay-on-hirid | GRU | MAE: 60.6±0.9 |
| respiratory-failure-on-hirid | LSTM | AUPRC: 0.569±0.003 |
| respiratory-failure-on-hirid | LGBM ( + hand crafted features) | AUPRC: 0.604±0.002 |
| respiratory-failure-on-hirid | TCN | AUPRC: 0.589±0.003 |
| respiratory-failure-on-hirid | GRU | AUPRC: 0.592±0.003 |
| respiratory-failure-on-hirid | Logistic Regression | AUPRC: 0.530±0.000 |
| respiratory-failure-on-hirid | LGBM | AUPRC: 0.585±0.001 |
| respiratory-failure-on-hirid | Transformer | AUPRC: 0.594±0.003 |