Using GAN for generating new tabular data

(Amrit ) #1

Hi all,

I wanted to reach out to see if any one had tried to use GANs to generate tabular data. Many establishments own tabular data structured in the form of records in a database however suffer from the issue of imbalanced data. For images you can use augmentation to generate new image data. Is there a way to generate new tabular data?


(nirant) #2

Hey @amritv,

There are certain risks and challenges with “synthesizing new tabular data” using any technique:

  • Image transforms are easy to understand and “interpret” structured data transforms are usually not
  • In case of time series, some trend insights might be lost via GANs or statistical transforms

For instance, let’s take e-commerce sales time series data for any particular SKU. This item might be subject to category influences (most babies are born in certain months, most electronics are bought on holidays) as well as competition pricing, distribution pains, delivery, change in user perceived quality among other things. The synthesized data might not capture these “trend” insights.

Handling imbalanced class distribution:

GANs for Data Augmention of Structured Data:

  • This blogpost explains different types of GAN being used for structured (categorical?) data augmentation.
  • e-Commerce-Conditional GAN from Amazon Machine Learning (ICLR 2018) is a great read on applying GANs to any specific challenge or data

(Amrit ) #3

Thanks @nirantk, really appreciate your response and links especially the paper

(Foivos Diakogiannis) #4

Hi @amritv,

yes I have, it is a super powerful method. You need to be cautious with overfitting the outliers of the data. I have used it with astronomy data (from 3 to 10 columns), with as few as 300 data points and remarkable results. The most effective recipe I have seen working is WGAN-GP with optimistic mirror descent (adam). I am currently trying to identify a condition on when the system has converged to the optimal value (any ideas?). Visual inspection is not good enough for research as a criterion. Not an easy one, as the WGAN loss takes forever to stabilize and if I reduce learning rate, things move again. Other performance metrics I’ve tested are kinda noisy. Getting there …

edit: Hot off the press, Section 3.2, WGAN-GP for data augmentation on tabular data (actually, positions and projected line-of-sight velocity (x,y,v)).