
摘要
建模表格数据中的行的概率分布并生成逼真的合成数据是一项非 trivial 的任务。表格数据通常包含离散列和连续列的混合。连续列可能具有多个模式,而离散列有时则存在不平衡现象,这使得建模变得困难。现有的统计模型和深度神经网络模型在处理这类数据时往往表现不佳。为此,我们设计了 TGAN(Tabular Generative Adversarial Network),该模型利用条件生成对抗网络来应对这些挑战。为了进行公平且全面的比较,我们设计了一个基准测试,其中包括 7 个模拟数据集和 8 个真实数据集,并选择了几种贝叶斯网络作为基线模型。实验结果表明,TGAN 在大多数真实数据集上优于贝叶斯方法,而其他深度学习方法则未能达到相同的效果。
代码仓库
oregonpillow/ctgan-server-cli
pytorch
GitHub 中提及
sdv-dev/CTGAN
pytorch
GitHub 中提及
glederrey/datgan
tf
GitHub 中提及
juliecious/ctgan
pytorch
GitHub 中提及
DAI-Lab/CTGAN
官方
pytorch
GitHub 中提及
Diyago/GAN-for-tabular-data
pytorch
lvyufeng/CTGAN-MindSpore
mindspore
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| tabular-data-generation-on-adult-census | CopulaGAN | DT Accuracy: 76.29 LR Accuracy: 80.61 Parameters(M): 0.300 RF Accuracy: 80.46 |
| tabular-data-generation-on-adult-census | TVAE | DT Accuracy: 82.8 LR Accuracy: 80.53 Parameters(M): 0.053 RF Accuracy: 83.48 |
| tabular-data-generation-on-adult-census | CTGAN | DT Accuracy: 81.32 LR Accuracy: 83.2 Parameters(M): 0.302 RF Accuracy: 83.53 |
| tabular-data-generation-on-california-housing | CTGAN | DT Mean Squared Error: 0.82 LR Mean Squared Error: 0.61 Parameters(M): 0.197 RF Mean Squared Error: 0.62 |
| tabular-data-generation-on-california-housing | TVAE | DT Mean Squared Error: 0.45 LR Mean Squared Error: 0.65 Parameters(M): 0.045 RF Mean Squared Error: 0.35 |
| tabular-data-generation-on-california-housing | CopulaGAN | DT Mean Squared Error: 1.19 LR Mean Squared Error: 0.98 Parameters(M): 0.201 RF Mean Squared Error: 0.99 |
| tabular-data-generation-on-diabetes | CTGAN | DT Accuracy: 0.4973 LR Accuracy: 0.5093 Parameters(M): 9.6 RF Accuracy: 0.5223 |
| tabular-data-generation-on-diabetes | TVAE | DT Accuracy: 0.5330 LR Accuracy: 0.5634 Parameters(M): 0.359 RF Accuracy: 0.5517 |
| tabular-data-generation-on-diabetes | CopulaGAN | DT Accuracy: 0.385 LR Accuracy: 0.4027 Parameters(M): 9.4 RF Accuracy: 0.3759 |
| tabular-data-generation-on-heloc | CTGAN | DT Accuracy: 61.34 LR Accuracy: 57.72 Parameters(M): 0.277 RF Accuracy: 62.35 |
| tabular-data-generation-on-heloc | TVAE | DT Accuracy: 76.39 LR Accuracy: 71.04 Parameters(M): 62 RF Accuracy: 77.24 |
| tabular-data-generation-on-heloc | CopulaGAN | DT Accuracy: 42.36 LR Accuracy: 42.03 Parameters(M): 0.276 RF Accuracy: 42.35 |
| tabular-data-generation-on-sick | CopulaGAN | DT Accuracy: 93.77 LR Accuracy: 94.57 Parameters(M): 0.226 RF Accuracy: 94.57 |
| tabular-data-generation-on-sick | TVAE | DT Accuracy: 95.39 LR Accuracy: 94.7 Parameters(M): 0.046 RF Accuracy: 94.91 |
| tabular-data-generation-on-sick | CTGAN | DT Accuracy: 92.05 LR Accuracy: 94.44 Parameters(M): 0.222 RF Accuracy: 94.57 |
| tabular-data-generation-on-travel | CTGAN | DT Accuracy: 73.3 LR Accuracy: 73.3 Parameters(M): 0.155 RF Accuracy: 71.41 |
| tabular-data-generation-on-travel | TVAE | DT Accuracy: 81.68 LR Accuracy: 79.58 Parameters(M): 0.036 RF Accuracy: 81.68 |
| tabular-data-generation-on-travel | CopulaGAN | DT Accuracy: 73.61 LR Accuracy: 73.3 Parameters(M): 0.157 RF Accuracy: 73.3 |