How to use categorical data in autoencoder in PyTorch?

kayn · May 22, 2020, 10:41am

Hi everyone. I’m new at Deep learning and my question is regarding the use of autoencoders (in PyTorch). I have a tabular dataset with a categorical feature that has 10 different categories. Names of these categories are quite different - some names consist of one word, some of two or three words. But all in all I have 10 unique category names. What I’m trying to do is to create an autoencoder which will encode names of these categories - for example, if I have a category named 'Medium size class' , I want to see if it is possible to train autoencoder to encode this name as something like 'mdmsc' or something like that. The use of it would be to found out which data points are hard to encode or not typical or something like that. I tried to adapt autoencoder architectures from various tutorials online however nothing seems to work for me or I simply do not know how to use them as they are all about images. Maybe someone has any idea how this type of autoencoder might be accomplished if it is at all possible?
Here’s the model I have so far (I just tried to adapt some architectures I found online):

class Autoencoder(nn.Module):

def __init__(self, input_shape, encoding_dim):
    super(Autoencoder, self).__init__()

    self.encode = nn.Sequential(
        nn.Linear(input_shape, 128),
        nn.ReLU(True),
        nn.Linear(128, 64),
        nn.ReLU(True),
        nn.Linear(64, encoding_dim),
    )

    self.decode = nn.Sequential(
        nn.Linear(encoding_dim, 64),
        nn.ReLU(True),
        nn.Linear(64, 128),
        nn.ReLU(True),
        nn.Linear(128, input_shape)
    )

def forward(self, x):
    x = self.encode(x)
    x = self.decode(x)
    return x

model = Autoencoder(input_shape=10, encoding_dim=5)

And also I use LabelEncoder() and then OneHotEncoder() to give these features/categories I mentioned numerical form. However, after training, output is the same as was input (no changes on the category name) but when I try to use only encoder part I’m unable to apply LabelEncoder() and then OneHotEncoder() because of dimension issues. I feel like maybe I can do something differently at the beginning, then I try to give those features numerical form, however I’m not sure what should I do.

I guess there might be some simple solution however I feel stuck. Thank you in advance for your help!

etremblay · May 28, 2020, 5:50pm

You might want to look at the code of dfencoder or simply use it directly. It takes a pandas dataframe as an input, autoencode the dataset and you can get back the encoded representations. It handles categorical data too. It is written in pytorch.