I wanted to reach out to see if any one had tried to use GANs to generate tabular data. Many establishments own tabular data structured in the form of records in a database however suffer from the issue of imbalanced data. For images you can use augmentation to generate new image data. Is there a way to generate new tabular data?
There are certain risks and challenges with “synthesizing new tabular data” using any technique:
- Image transforms are easy to understand and “interpret” structured data transforms are usually not
- In case of time series, some trend insights might be lost via GANs or statistical transforms
For instance, let’s take e-commerce sales time series data for any particular SKU. This item might be subject to category influences (most babies are born in certain months, most electronics are bought on holidays) as well as competition pricing, distribution pains, delivery, change in user perceived quality among other things. The synthesized data might not capture these “trend” insights.
Handling imbalanced class distribution:
GANs for Data Augmention of Structured Data:
- This blogpost explains different types of GAN being used for structured (categorical?) data augmentation.
e-Commerce-Conditional GAN from Amazon Machine Learning (ICLR 2018) is a great read on applying GANs to any specific challenge or data
Thanks @nirantk, really appreciate your response and links especially the paper
yes I have, it is a super powerful method. You need to be cautious with overfitting the outliers of the data. I have used it with astronomy data (from 3 to 10 columns), with as few as 300 data points and remarkable results. The most effective recipe I have seen working is WGAN-GP with optimistic mirror descent (adam). I am currently trying to identify a condition on when the system has converged to the optimal value (any ideas?). Visual inspection is not good enough for research as a criterion. Not an easy one, as the WGAN loss takes forever to stabilize and if I reduce learning rate, things move again. Other performance metrics I’ve tested are kinda noisy. Getting there …
edit: Hot off the press, Section 3.2, WGAN-GP for data augmentation on tabular data (actually, positions and projected line-of-sight velocity (x,y,v)).
Can you share the architecture you used?
The architecture is documented in the paper, this will get you started. It is an implementation of Optimistic Mirror Descent Adam, it’s the most reliable training (WGAN-GP) in all of my tests so far. I still haven’t had time to test on images, getting there …
I’ve recently been looking at a lot of tabular gan papers. If you have mixed data (continuous, discrete and categorical), tableGAN is a valuable source to look at. They use convolutional neural networks to generate fake data, where categorical and discrete data is converted to a number between 0 and nr_classes, which is then MinMaxScaled. My results haven’t been fantastic yet, but I only just chugged my data in and looked how it went, so many improvement can be done. The code of tableGAN is absolute horror and written in TF, so i’m trying to get something workable in pytorch from that, possibly using the OMDadam optimizer.
If you have no continuous data but only categorical and discrete, I think medGAN is the easiest to try. They use autoencoder between the generator and discriminator to transform the original data to a continuous representation and get pretty good results.
Nice to see others working on this. Could you elaborate on some of the things you wrote about?
- Why did you choose to do a second round of adversarial training between t_train+t_synth and t_test. Assuming train_data and test_data have a similar distribution, just using synt_data as target 0 and t_train as target 1 should give you faster results, right?
- Is the extend_from_original basically just sampling the original data? Or is it doing something more?
- The second round of adversarial training uses CatBoost, which is clearly not a neural model. Does the introduction of a non-neural network model have interesting effects when training a gan? Why shouldn’t you use a neural net based model for this second round of training?
- How do you compare t_train and t_synth to t_test in step 5/6. How do new fake datapoints correspond to t_test?
Thank for your response. The task I have been trying to solve mostly associated with job tasks. There we have some kind of time-based model, so no doubt - train and test don’t have always the same distribution. So the goal is to make the train more similar to the test. The last table in the articles shows that GAN performed better than distribution wasn’t the same.
- I thought that the train is useful on its own, why don’t you use it. I tried the proposed method by you as well, showed the less stable result
- extend_from_original - it samples from the train by using the same adversarial training.
- Not catboost but lightgbm with catboost target encoder, but you right. Historically boosters show really good scores usually outperforming NNets. I had an idea to train GANs with additional loss showing how good NNet performs on train/test - but it’s kind of out scope and my time
- I have checked accuracy - it was between 60-95% - so sometimes good so somites not so much