Porto Seguro Winning Solution -- Representation learning


#41

Very interesting info Ben, thank you for sharing :slight_smile:

By much better results you mean ability to reproduce MNIST images from your validation set? What cost function does one normally use for that - some distance measure on pixel values?

In general, we would use the ability of an autoencoder to reproduce the inputs as a way of gauging the quality of the middle layer activations?


(Ben Eacrett) #42

I saw much better results than just noising the data and better than adding l1 regularization (this was used by the tutorials I have followed -> from the keras blog). Note - did not try batch norm.

Error - used the same as in the tutorials -> just MSE on the reproductions. And yes, ‘better results’ meant less error (perhaps subjective, but also looked better visually at similar error levels)

My interpretation is that the ability to closely reproduce inputs implies that the ae has been able to learn / extract latent structure(s) in the data (which should correlate to potential usefulness of the encodings).

As a side note, for a lot of deep learning based solutions, it makes sense that ae’s have not retained popularity - you might expect the network you are using to extract this structure anyways (I think Jeremy commented or alluded to this elsewhere). In the structured data case we’re discussing here, perhaps we can look at the ae as doing something analogous to embeddings -> extracting a rich feature representation.


(Kerem Turgutlu) #43

Still experimenting (sample data is used in the results below):

  1. cv gini scores with ordinary MLP only using OHE and raw features:

[0.25504768039310477,
0.23858611474450264,
0.24050216487307599,
0.25277974470086056,
0.2294581250655717]

  1. My second attempt was to use a relu activated MLP as autoencoder but it failed badly (even worse than above).

  2. When used linear activation in bottleneck and relu for others this is gini cv scores:

[0.25896930901067799,
0.258179266788859,
0.28943407733722043,
0.27238671631311695,
0.25446615547298773]


#44

Should someone find this thread sometime down the road… here is a really neat paper that provides a nice overview of autoencoders and their applicability.

Extracting and Composing Robust Features with Denoising Autoencoders by Vincet et al


(Even Oldridge) #45

I’m taking a look at your github as this is of interest to me right now and one thing I did notice was that the winning solution wasn’t entirely linear. Only the middle layer was.

I’m curious about how the embedding model worked. That was my first instinct as well, especially since some of my categorical variables are huge categories.

Are you going to be in part II this spring? If so maybe we can work on this together.


(Kerem Turgutlu) #46

Hi Even,

I will be taking Part 2 starting next week as well. DAE should have relu activation except for middle layer where middle layer has linear activation. I don’t know what is the main motivation behind this. If this is not the case in my class definition I should check that. Embedding model as it is didn’t work much better than Xgb models shared in kaggle.


(Even Oldridge) #47

I think I have a working version of Porto that uses the structured data embeddings. The only component missing is how to weigh the categorical loss with the continuous.

I’m using cross_entropy per categorical, and MSE for the continuous. I’ve tried as a starting point weighting them by the per variable loss, so the MSE/cont = CE/cat but that feels like a naive approach and is unlikely to work since the two functions are so different.

@jeremy do you have any ideas / experience with this? I’m wondering if @rachel or one of the other math wizards can provide a theoretical framework. The only discussion online about this seems to be about comparing apples to oranges and how you shouldn’t do it.


(Jeremy Howard) #48

I’m not aware of any basis for this other than trying things and seeing what works.


(Even Oldridge) #49

I’ve played around with a few, but the one I’m settling on for now is MSE + MCE (Mean Cross Entropy), which is at least consistent across models of varying sizes.

I tried some other metrics based on balancing the error between the continuous and categorical elements, but it was hard to interpret the loss.

So far I’m able to train the model to a validation loss of 0.51 for normally distributed continuous variables with a stdev of 1, which I think is good, but i’ll need to compare the outputs.

It’s still underfitting slightly, which may be the result of using both swap column data augmentation and dropout, so I’ll have to explore lowering the dropout. Eventually I need to have an ablation study.

The other thing I’d like to do is compare it to the original VAE which outputs 1-hot encodings and uses MSE for the entire output vector. I think to do so I just need to modify the loss such that the categoricals are output in that form. I’m curious to see if explicit category embeddings and cross entropy loss help make a better fitting model.