Porto Seguro Winning Solution -- Representation learning

Very interesting info Ben, thank you for sharing :slight_smile:

By much better results you mean ability to reproduce MNIST images from your validation set? What cost function does one normally use for that - some distance measure on pixel values?

In general, we would use the ability of an autoencoder to reproduce the inputs as a way of gauging the quality of the middle layer activations?

I saw much better results than just noising the data and better than adding l1 regularization (this was used by the tutorials I have followed -> from the keras blog). Note - did not try batch norm.

Error - used the same as in the tutorials -> just MSE on the reproductions. And yes, ā€˜better resultsā€™ meant less error (perhaps subjective, but also looked better visually at similar error levels)

My interpretation is that the ability to closely reproduce inputs implies that the ae has been able to learn / extract latent structure(s) in the data (which should correlate to potential usefulness of the encodings).

As a side note, for a lot of deep learning based solutions, it makes sense that aeā€™s have not retained popularity - you might expect the network you are using to extract this structure anyways (I think Jeremy commented or alluded to this elsewhere). In the structured data case weā€™re discussing here, perhaps we can look at the ae as doing something analogous to embeddings -> extracting a rich feature representation.

2 Likes

Still experimenting (sample data is used in the results below):

  1. cv gini scores with ordinary MLP only using OHE and raw features:

[0.25504768039310477,
0.23858611474450264,
0.24050216487307599,
0.25277974470086056,
0.2294581250655717]

  1. My second attempt was to use a relu activated MLP as autoencoder but it failed badly (even worse than above).

  2. When used linear activation in bottleneck and relu for others this is gini cv scores:

[0.25896930901067799,
0.258179266788859,
0.28943407733722043,
0.27238671631311695,
0.25446615547298773]

1 Like

Should someone find this thread sometime down the roadā€¦ here is a really neat paper that provides a nice overview of autoencoders and their applicability.

Extracting and Composing Robust Features with Denoising Autoencoders by Vincet et al

6 Likes

Iā€™m taking a look at your github as this is of interest to me right now and one thing I did notice was that the winning solution wasnā€™t entirely linear. Only the middle layer was.

Iā€™m curious about how the embedding model worked. That was my first instinct as well, especially since some of my categorical variables are huge categories.

Are you going to be in part II this spring? If so maybe we can work on this together.

1 Like

Hi Even,

I will be taking Part 2 starting next week as well. DAE should have relu activation except for middle layer where middle layer has linear activation. I donā€™t know what is the main motivation behind this. If this is not the case in my class definition I should check that. Embedding model as it is didnā€™t work much better than Xgb models shared in kaggle.

1 Like

I think I have a working version of Porto that uses the structured data embeddings. The only component missing is how to weigh the categorical loss with the continuous.

Iā€™m using cross_entropy per categorical, and MSE for the continuous. Iā€™ve tried as a starting point weighting them by the per variable loss, so the MSE/cont = CE/cat but that feels like a naive approach and is unlikely to work since the two functions are so different.

@jeremy do you have any ideas / experience with this? Iā€™m wondering if @rachel or one of the other math wizards can provide a theoretical framework. The only discussion online about this seems to be about comparing apples to oranges and how you shouldnā€™t do it.

Iā€™m not aware of any basis for this other than trying things and seeing what works.

Iā€™ve played around with a few, but the one Iā€™m settling on for now is MSE + MCE (Mean Cross Entropy), which is at least consistent across models of varying sizes.

I tried some other metrics based on balancing the error between the continuous and categorical elements, but it was hard to interpret the loss.

So far Iā€™m able to train the model to a validation loss of 0.51 for normally distributed continuous variables with a stdev of 1, which I think is good, but iā€™ll need to compare the outputs.

Itā€™s still underfitting slightly, which may be the result of using both swap column data augmentation and dropout, so Iā€™ll have to explore lowering the dropout. Eventually I need to have an ablation study.

The other thing Iā€™d like to do is compare it to the original VAE which outputs 1-hot encodings and uses MSE for the entire output vector. I think to do so I just need to modify the loss such that the categoricals are output in that form. Iā€™m curious to see if explicit category embeddings and cross entropy loss help make a better fitting model.

@kcturgutlu

Kerem,

Itā€™s probably been a while since you touched DAEsā€¦

I found your github useful to get started. Is your latest version in DAE.py or in PortoSeguro.ipynb?

Did you end up using the activations from the DAE to train a Porto Seguro model?

I am considering a slightly different approach to applying the technique with fastai:

  1. ColumnarModelData and ColumnarDataset can easily be modified to accept tfms
  2. inputSwapNoise is turned into a class with parameter p and call method
  3. Next I was thinking that, instead of saving activations, we could save the first few layers of the autoencoder (up until the activations we are interested in); and reuse them in our final model after freezing them.

This is all stuff thatā€™s available in fastai for image data, and apparently not for columnar data.

Thanks in advance for any insights.

1 Like

Even though this result quite old, I thought I might try to implement it (I can see that Iā€™m not the first) Naturally I have lot of questions.

Michael Jahrer did a great writeup about this but I need some help to understand the details

I have 3 questions and I hope you more experienced people could offer answers

The idea seems to be to train a model (using a denoising auto encoder) and then use the activations of the hidden layers as features. This is very similar to using an auto encoder to reduce dimensions but here the hidden layers are big so the dimensions go up.

Q1
step 1 is to use data augmentation to get noisy inputs by replaced p percent of each input value with a value from another row. Now, if a given column has only a few values and they are not equally frequent there is a pretty good chance the replacement value is the same as the original. Is the p percent the number of original values that get replaced or the number of change?
Q2
Normalization, Michel Jahrer normalized the data using a thing called rankgauss this does not just move the mean to 0 and the sd to 1, it also reshapes the distribution to be more bell shaped - I have not seen any papers talking about that. Why is this a good idea and when should you use it?

Q3 Finally Jahrer use 5 fold cv, which means training the model 5 times each time using 80 of the data for training and 20% for validation and then averaging the 5 models predictions.
I didnā€™t see anything about this in the fastai class and as best as I understand cross validation you need the models to be good at different things. How do you know when using more models each trained on a subset of the data would be better than a single model trained on the entire data set

Hope these questions are not so simple that I should have figured them out my self