HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Language Models are Realistic Tabular Data Generators

Vadim Borisov Kathrin Seßler Tobias Leemann Martin Pawelczyk Gjergji Kasneci

Language Models are Realistic Tabular Data Generators

Abstract

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Code Repositories

kathrinse/be_great
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
tabular-data-generation-on-adult-censusDistill-GReaT
DT Accuracy: 84.49
LR Accuracy: 84.65
Parameters(M): 82
RF Accuracy: 85.25
tabular-data-generation-on-adult-censusGReaT
DT Accuracy: 84.81
LR Accuracy: 84.77
Parameters(M): 355
RF Accuracy: 85.42
tabular-data-generation-on-california-housingDistill-GReaT
DT Mean Squared Error: 0.43
LR Mean Squared Error: 0.57
Parameters(M): 82
RF Mean Squared Error: 0.32
tabular-data-generation-on-california-housingGReaT
DT Mean Squared Error: 0.39
LR Mean Squared Error: 0.34
Parameters(M): 355
RF Mean Squared Error: 0.28
tabular-data-generation-on-diabetesDistill-GReaT
DT Accuracy: 0.541
LR Accuracy: 0.5733
Parameters(M): 82
RF Accuracy: 0.5803
tabular-data-generation-on-diabetesGReaT
DT Accuracy: 0.5523
LR Accuracy: 0.5734
Parameters(M): 355
RF Accuracy: 0.5834
tabular-data-generation-on-helocGReaT
DT Accuracy: 79.1
LR Accuracy: 71.9
Parameters(M): 355
RF Accuracy: 80.93
tabular-data-generation-on-helocDistill-GReaT
DT Accuracy: 81.4
LR Accuracy: 70.58
Parameters(M): 82
RF Accuracy: 82.14
tabular-data-generation-on-sickGReaT
DT Accuracy: 97.72
LR Accuracy: 97.72
Parameters(M): 355
RF Accuracy: 98.3
tabular-data-generation-on-sickDistill-GReaT
DT Accuracy: 95.39
LR Accuracy: 96.56
Parameters(M): 82
RF Accuracy: 97.72
tabular-data-generation-on-travelGReaT
DT Accuracy: 83.56
LR Accuracy: 80.1
Parameters(M): 355
RF Accuracy: 84.3
tabular-data-generation-on-travelDistill-GReaT
DT Accuracy: 77.38
LR Accuracy: 78.53
Parameters(M): 82
RF Accuracy: 79.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Language Models are Realistic Tabular Data Generators | Papers | HyperAI