Tabular Autoencoder?

I am trying to make an autoencoder for a tabular dataset using fast.AI v1. I have seen people using them for images and I’m trying to follow, but I am not sure how to implement that for tabular data.

https://alanbertl.com/autoencoder-with-fast-ai/

That is what I am trying to follow. My data has 74 columns that I am trying to recreate using the Autoencoder so I can extrapolate the important relations. Any help or advice on where to start looking at how to implement the above would be greatly appreciative. Thanks,

Zach

2 Likes

Interested to learn more as well.

I’ve implemented one before in v0.7 and I’m working on porting it to V1.0. There’s a great discussion on this on kaggle where it was used to generate features that won the Porto Seguro safe driver competition.

The two trickiest parts are the shuffling of data, which I’ve got a nice trick for partly because it would have been so complex to integrate into fastai, and the dataloader, which I think should be easier in V1 using label_by_func but I haven’t implemented it.

For data shuffling I originally implemented it in the dataloader, but it’s not as efficient or easy to integrate as my new solution, which is to swap within the batch as a module. Here’s some code to get you started:

class BatchSwapNoise(nn.Module):
“”“Swap Noise module”“”

def __init__(self, p):
    super().__init__()
    self.p = p

def forward(self, x):
    if self.training:
        mask = torch.rand(x.size()) > (1 - self.p)
        idx = torch.add(torch.arange(x.nelement()),
                        (torch.floor(torch.rand(x.size()) * x.size(0)).type(torch.LongTensor) *
                         (mask.type(torch.LongTensor) * x.size(1))).view(-1))
        idx[idx>=x.nelement()] = idx[idx>=x.nelement()]-x.nelement()
        return x.view(-1)[idx].view(x.size())
    else:
        return x

There’s more discussion on the forum here:

3 Likes

I am currently about to take part 2 of the course soon, so the scale up in complexity is something I’m attempting to learn rather quickly as best I can. So when I create my model, this is what I get:

TabularModel(
(embeds): ModuleList()
(emb_drop): Dropout(p=0.0)
(bn_cont): BatchNorm1d(55, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=55, out_features=221, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(221, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Linear(in_features=221, out_features=1500, bias=True)
(4): ReLU(inplace)
(5): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Linear(in_features=1500, out_features=1500, bias=True)
(7): ReLU(inplace)
(8): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(9): Linear(in_features=1500, out_features=1500, bias=True)
(10): ReLU(inplace)
(11): BatchNorm1d(1500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(12): Linear(in_features=1500, out_features=221, bias=True)
(13): ReLU(inplace)
(14): BatchNorm1d(221, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(15): Linear(in_features=221, out_features=55, bias=True)
)
)

To implement BatchSwapNoise into said generated model, would I need to do a model.layers.add_module()? Also what is the p in the parameters?

Thanks!

It should be your first layer, so you can’t add it to the existing model in that way, you need to put it at the start.

The p is same as for dropout; probability of a swap [0…1]

1 Like

@Even What would a proper loss function be? I believe I have everything set up how it needs to be, but when I run my model it says accuracy() missing 1 required positional argument: 'targs'

I have tried setting it up as learn = Learner(data, model) and learn = Learner(data, model, accuracy)
Do I need a custom loss function?

Thanks!

Were any of you able to replicate the Porto Seguro results? I tried months ago but failed… I also tried to look up successful replications and didn’t find any.

Maybe it’s time for me to dig up my implementation.

It’s typically trained on MSE, although I’ve played around with that a bit as well.

You can’t use accuracy because there are multiple categorical and continuous variables.

I was able to get it working, yeah. It’s pretty tricky for sure. :slight_smile:

1 Like

Gotcha. What if I were to have them all continuous values? Also I receive TypeError: mean_squared_error() missing 1 required positional argument: 'targ'

Accuracy doesn’t make sense in a continuous context. Not sure about the type error. I’d try on the fastai users forum if you can’t figure it out from googling and trying your own debugging.

hi, I have created a denoised autoencoder replicating the porto seguro’s safe driver prediction solution on kaggle.

Here is the kernel.
I will modify it for more.

5 Likes

Hey @kachun1017 can you change the link to remove the edit? It prevents those without access to your account from getting there!

just done it. Thanks!

Hi Zach, did you figure out finally how to use Fastai Tabular with the methods of Autoencoder? I would be very interested to see some notebooks on this.

Hey @abhikjha, I tried adding a DAE, and I did see some better results but they were negligible.

Thanks Zach. If you don’t mind, can you share the notebook wherein you implemented DAE. I could build a simple autoencoder in plain pytorch (i.e. without using Fastai) for a binary classification problem but I really want to take the advantage of Fastai’s modules such as fit_one_cycle etc…

Moreover, I am trying to build an autoencoder to solve multi-class classification problem in tabular data. I can’t seem to find any reference for implementing this. Have you come across any such implementation.

I’m afraid I cannot, as I’m using it in research right now. As a hint though essentially I had to change tabular_learners source code to generate this model, and then took inspiration from the porto model.

For the second problem, I have not but I also haven’t done extensive research into multi-class in general. Apologies!

I recently used fastai V1 and autoencoder to analyze high-dimensional biological data that are in a tabular format:

7 Likes

does anyone no if there are results using this approach with variational autoencoders?