
摘要
表格数据是历史最悠久且应用最广泛的數據形式之一。然而,如何生成在原始數據特徵上保持一致的合成樣本,仍是表格數據生成領域面臨的重大挑戰。儘管來自計算機視覺領域的多種生成模型(如變分自編碼器或生成對抗網絡)已被適應用於表格數據生成,但針對近年興起的基於變壓器(Transformer)的大型語言模型(LLM)的研究仍相對有限。值得注意的是,這些大型語言模型本身也具有生成能力。為此,我們提出了 GReaT(Realistic Tabular Data Generation,真實表格數據生成)方法,該方法利用自回歸生成式大型語言模型,能夠生成既具有高度真實性又符合原始數據分布的合成表格數據。此外,GReaT 支持通過對任意特徵子集進行條件建模來捕捉表格數據的分佈,其餘特徵可在不增加額外計算開銷的情況下進行採樣。我們通過一系列實驗,從多個角度量化評估了所提方法生成數據的真實性與質量。實驗結果表明,GReaT 在多個真實世界與合成數據集上,無論是特徵類型多樣性還是數據規模差異,均能保持當前最優的性能表現。
代码仓库
kathrinse/be_great
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| tabular-data-generation-on-adult-census | Distill-GReaT | DT Accuracy: 84.49 LR Accuracy: 84.65 Parameters(M): 82 RF Accuracy: 85.25 |
| tabular-data-generation-on-adult-census | GReaT | DT Accuracy: 84.81 LR Accuracy: 84.77 Parameters(M): 355 RF Accuracy: 85.42 |
| tabular-data-generation-on-california-housing | Distill-GReaT | DT Mean Squared Error: 0.43 LR Mean Squared Error: 0.57 Parameters(M): 82 RF Mean Squared Error: 0.32 |
| tabular-data-generation-on-california-housing | GReaT | DT Mean Squared Error: 0.39 LR Mean Squared Error: 0.34 Parameters(M): 355 RF Mean Squared Error: 0.28 |
| tabular-data-generation-on-diabetes | Distill-GReaT | DT Accuracy: 0.541 LR Accuracy: 0.5733 Parameters(M): 82 RF Accuracy: 0.5803 |
| tabular-data-generation-on-diabetes | GReaT | DT Accuracy: 0.5523 LR Accuracy: 0.5734 Parameters(M): 355 RF Accuracy: 0.5834 |
| tabular-data-generation-on-heloc | GReaT | DT Accuracy: 79.1 LR Accuracy: 71.9 Parameters(M): 355 RF Accuracy: 80.93 |
| tabular-data-generation-on-heloc | Distill-GReaT | DT Accuracy: 81.4 LR Accuracy: 70.58 Parameters(M): 82 RF Accuracy: 82.14 |
| tabular-data-generation-on-sick | GReaT | DT Accuracy: 97.72 LR Accuracy: 97.72 Parameters(M): 355 RF Accuracy: 98.3 |
| tabular-data-generation-on-sick | Distill-GReaT | DT Accuracy: 95.39 LR Accuracy: 96.56 Parameters(M): 82 RF Accuracy: 97.72 |
| tabular-data-generation-on-travel | GReaT | DT Accuracy: 83.56 LR Accuracy: 80.1 Parameters(M): 355 RF Accuracy: 84.3 |
| tabular-data-generation-on-travel | Distill-GReaT | DT Accuracy: 77.38 LR Accuracy: 78.53 Parameters(M): 82 RF Accuracy: 79.5 |