Tabular Autoencoder?

muellerzr · March 15, 2019, 3:21pm

I am trying to make an autoencoder for a tabular dataset using fast.AI v1. I have seen people using them for images and I’m trying to follow, but I am not sure how to implement that for tabular data.

https://alanbertl.com/autoencoder-with-fast-ai/

That is what I am trying to follow. My data has 74 columns that I am trying to recreate using the Autoencoder so I can extrapolate the important relations. Any help or advice on where to start looking at how to implement the above would be greatly appreciative. Thanks,

Zach

jeremyeast · March 17, 2019, 4:18am

Interested to learn more as well.

Even · March 17, 2019, 4:45am

I’ve implemented one before in v0.7 and I’m working on porting it to V1.0. There’s a great discussion on this on kaggle where it was used to generate features that won the Porto Seguro safe driver competition.

The two trickiest parts are the shuffling of data, which I’ve got a nice trick for partly because it would have been so complex to integrate into fastai, and the dataloader, which I think should be easier in V1 using label_by_func but I haven’t implemented it.

For data shuffling I originally implemented it in the dataloader, but it’s not as efficient or easy to integrate as my new solution, which is to swap within the batch as a module. Here’s some code to get you started:

class BatchSwapNoise(nn.Module):
“”“Swap Noise module”“”

def __init__(self, p):
    super().__init__()
    self.p = p

def forward(self, x):
    if self.training:
        mask = torch.rand(x.size()) > (1 - self.p)
        idx = torch.add(torch.arange(x.nelement()),
                        (torch.floor(torch.rand(x.size()) * x.size(0)).type(torch.LongTensor) *
                         (mask.type(torch.LongTensor) * x.size(1))).view(-1))
        idx[idx>=x.nelement()] = idx[idx>=x.nelement()]-x.nelement()
        return x.view(-1)[idx].view(x.size())
    else:
        return x

There’s more discussion on the forum here:

muellerzr · March 17, 2019, 5:09am

I am currently about to take part 2 of the course soon, so the scale up in complexity is something I’m attempting to learn rather quickly as best I can. So when I create my model, this is what I get:

TabularModel(
(embeds): ModuleList()
(emb_drop): Dropout(p=0.0)
(bn_cont): BatchNorm1d(55, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=55, out_features=221, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(221, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Linear(in_features=221, out_features=1500, bias=True)
(4): ReLU(inplace)
(5): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Linear(in_features=1500, out_features=1500, bias=True)
(7): ReLU(inplace)
(8): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(9): Linear(in_features=1500, out_features=1500, bias=True)
(10): ReLU(inplace)
(11): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(12): Linear(in_features=1500, out_features=221, bias=True)
(13): ReLU(inplace)
(14): BatchNorm1d(221, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(15): Linear(in_features=221, out_features=55, bias=True)
)
)

To implement BatchSwapNoise into said generated model, would I need to do a model.layers.add_module()? Also what is the p in the parameters?

Thanks!

Even · March 18, 2019, 3:10am

It should be your first layer, so you can’t add it to the existing model in that way, you need to put it at the start.

The p is same as for dropout; probability of a swap [0…1]

muellerzr · March 18, 2019, 6:02pm

@Even What would a proper loss function be? I believe I have everything set up how it needs to be, but when I run my model it says accuracy() missing 1 required positional argument: 'targs'

I have tried setting it up as learn = Learner(data, model) and learn = Learner(data, model, accuracy)
Do I need a custom loss function?

Thanks!

Seb · March 18, 2019, 6:17pm

Were any of you able to replicate the Porto Seguro results? I tried months ago but failed… I also tried to look up successful replications and didn’t find any.

Maybe it’s time for me to dig up my implementation.

Even · March 18, 2019, 10:26pm

It’s typically trained on MSE, although I’ve played around with that a bit as well.

You can’t use accuracy because there are multiple categorical and continuous variables.

Even · March 18, 2019, 10:27pm

I was able to get it working, yeah. It’s pretty tricky for sure.

muellerzr · March 18, 2019, 10:37pm

Gotcha. What if I were to have them all continuous values? Also I receive TypeError: mean_squared_error() missing 1 required positional argument: 'targ'

Even · March 19, 2019, 3:12am

Accuracy doesn’t make sense in a continuous context. Not sure about the type error. I’d try on the fastai users forum if you can’t figure it out from googling and trying your own debugging.

kachun1017 · March 25, 2019, 6:40pm

hi, I have created a denoised autoencoder replicating the porto seguro’s safe driver prediction solution on kaggle.

Here is the kernel.
I will modify it for more.

mindtrinket · April 12, 2019, 1:26pm

Hey @kachun1017 can you change the link to remove the edit? It prevents those without access to your account from getting there!

kachun1017 · April 12, 2019, 5:27pm

just done it. Thanks!

abhikjha · August 9, 2019, 6:41am

Hi Zach, did you figure out finally how to use Fastai Tabular with the methods of Autoencoder? I would be very interested to see some notebooks on this.

muellerzr · August 12, 2019, 5:47pm

Hey @abhikjha, I tried adding a DAE, and I did see some better results but they were negligible.

abhikjha · August 13, 2019, 10:51am

Thanks Zach. If you don’t mind, can you share the notebook wherein you implemented DAE. I could build a simple autoencoder in plain pytorch (i.e. without using Fastai) for a binary classification problem but I really want to take the advantage of Fastai’s modules such as fit_one_cycle etc…

Moreover, I am trying to build an autoencoder to solve multi-class classification problem in tabular data. I can’t seem to find any reference for implementing this. Have you come across any such implementation.

muellerzr · August 13, 2019, 11:14am

I’m afraid I cannot, as I’m using it in research right now. As a hint though essentially I had to change tabular_learners source code to generate this model, and then took inspiration from the porto model.

For the second problem, I have not but I also haven’t done extensive research into multi-class in general. Apologies!

ytian · September 25, 2019, 2:28pm

I recently used fastai V1 and autoencoder to analyze high-dimensional biological data that are in a tabular format:

takotab · October 2, 2019, 7:29am

does anyone no if there are results using this approach with variational autoencoders?